from:"Pawel Jakub Dawidek"

Re: [zfs-discuss] ZFS monitoring

2013-02-12 Thread Pawel Jakub Dawidek

On Mon, Feb 11, 2013 at 05:39:27PM +0100, Jim Klimov wrote:
 On 2013-02-11 17:14, Borja Marcos wrote:
 
  On Feb 11, 2013, at 4:56 PM, Tim Cook wrote:
 
  The zpool iostat output has all sorts of statistics I think would be 
  useful/interesting to record over time.
 
 
  Yes, thanks :) I think I will add them, I just started with the esoteric 
  ones.
 
  Anyway, still there's no better way to read it than running zpool iostat 
  and parsing the output, right?
 
 
 I believe, in this case you'd have to run it as a continuous process
 and parse the outputs after the first one (overall uptime stat, IIRC).
 Also note that on problems with ZFS engine itself, zpool may lock up
 and thus halt your program - so have it ready to abort an outstanding
 statistics read after a timeout and perhaps log an error.
 
 And if pools are imported-exported during work, the zpool iostat
 output changes dynamically, so you basically need to parse its text
 structure every time.
 
 The zpool iostat -v might be even more interesting though, as it lets
 you see per-vdev statistics and perhaps notice imbalances, etc...
 
 All that said, I don't know if this data isn't also available as some
 set of kstats - that would probably be a lot better for your cause.
 Inspect the zpool source to see where it gets its numbers from...
 and perhaps make and RTI relevant kstats, if they aren't yet there ;)
 
 On the other hand, I am not certain how Solaris-based kstats interact
 or correspond to structures in FreeBSD (or Linux for that matter)?..

I made kstat data available on FreeBSD via 'kstat' sysctl tree:

# sysctl kstat

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://tupytaj.pl


pgpyFGpZBBFM1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-19 Thread Pawel Jakub Dawidek

On Mon, Dec 19, 2011 at 10:18:05AM +, Darren J Moffat wrote:
 On 12/18/11 11:52, Pawel Jakub Dawidek wrote:
  On Thu, Dec 15, 2011 at 04:39:07PM -0700, Cindy Swearingen wrote:
  Hi Anon,
 
  The disk that you attach to the root pool will need an SMI label
  and a slice 0.
 
  The syntax to attach a disk to create a mirrored root pool
  is like this, for example:
 
  # zpool attach rpool c1t0d0s0 c1t1d0s0
 
  BTW. Can you, Cindy, or someone else reveal why one cannot boot from
  RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code
  would have to be licensed under GPL as the rest of the boot code?
 
  I'm asking, because I see no technical problems with this functionality.
  Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev
  pools works just fine on FreeBSD for a long time now. Not being forced
  to have dedicated pool just for the root if you happen to have more than
  two disks in you box is very convenient.
 
 For those of us not familiar with how FreeBSD is installed and boots can 
 you explain how boot works (ie do you use GRUB at all and if so which 
 version and where the early boot ZFS code is).

We don't use GRUB, no. We use three stages for booting. Stage 0 is
bascially 512 byte of very simple MBR boot loader installed at the
begining of the disk that is used to launch stage 1 boot loader. Stage 1
is where we interpret all ZFS (or UFS) structure and read real files.
When you use GPT, there is dedicated partition (of type freebsd-boot)
where you install gptzfsboot binary (stage 0 looks for GPT partition of
type freebsd-boot, loads it and starts the code in there). This
partition doesn't contain file system of course, boot0 is too simple to
read any file system. The gptzfsboot is where we handle all ZFS-related
operations. gptzfsboot is mostly used to find root dataset and load
zfsloader from there. The zfsloader is the last stage in booting. It
shares the same ZFS-related code as gptzfsboot (but compiled into
separate binary), it loads modules and the kernel and starts it.
The zfsloader is stored in /boot/ directory on the root dataset.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpRIZh6GXH13.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can I create a mirror for a root rpool?

2011-12-18 Thread Pawel Jakub Dawidek

On Sun, Dec 18, 2011 at 07:24:27PM +0700, Fajar A. Nugraha wrote:
 On Sun, Dec 18, 2011 at 6:52 PM, Pawel Jakub Dawidek p...@freebsd.org wrote:
  BTW. Can you, Cindy, or someone else reveal why one cannot boot from
  RAIDZ on Solaris? Is this because Solaris is using GRUB and RAIDZ code
  would have to be licensed under GPL as the rest of the boot code?
 
  I'm asking, because I see no technical problems with this functionality.
  Booting off of RAIDZ (even RAIDZ3) and also from multi-top-level-vdev
  pools works just fine on FreeBSD for a long time now.
 
 Really? How do they do that?

Well, the boot code has access to all the disks, so it is just matter of
being able to intepret the data, which our boot code can do.

 In Linux, you can boot from disks with GPT label with grub2, and have
 / on raidz, but only as long as /boot is on grub2-compatible fs
 (e.g. single or mirrored zfs pool, ext4, etc).

This is not the same. On FreeBSD everything, including root file system
and boot directory, can be on RAIDZ.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpe652DxyN2F.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-13 Thread Pawel Jakub Dawidek

On Mon, Dec 12, 2011 at 08:30:56PM +0400, Jim Klimov wrote:
 2011-12-12 19:03, Pawel Jakub Dawidek пишет:
  As I said, ZFS reading path involves no dedup code. No at all.
 
 I am not sure if we contradicted each other ;)
 
 What I meant was that the ZFS reading path involves reading
 logical data blocks at some point, consulting the cache(s)
 if the block is already cached (and up-to-date). These blocks
 are addressed by some unique ID, and now with dedup there are
 several pointers to same block.
 
 So, basically, reading a file involves reading ZFS metadata,
 determining data block IDs, fetching them from disk or cache.
 
 Indeed, this does not need to be dedup-aware; but if the other
 chain of metadata blocks points to same data or metadata blocks
 which were already cached (for whatever reason, not limited to
 dedup) - this is where the read-speed boost appears.
 Likewise, if some blocks are not cached, such as metadata
 needed to determine the second file's block IDs, this incurs
 disk IOs and may decrease overall speed.

Ok, you are right, although in this test, I believe metadata of the
other file was already prefetched. I'm using this box for something else
now, so can't retest, but the procedure is so easy that everyone is
welcome to try it:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpNepBs6v1MX.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-12 Thread Pawel Jakub Dawidek

On Sun, Dec 11, 2011 at 04:04:37PM +0400, Jim Klimov wrote:
 I would not be surprised to see that there is some disk IO
 adding delays for the second case (read of a deduped file
 clone), because you still have to determine references
 to this second file's blocks, and another path of on-disk
 blocks might lead to it from a separate inode in a separate
 dataset (or I might be wrong). Reading this second path of
 pointers to the same cached data blocks might decrease speed
 a little.

As I said, ZFS reading path involves no dedup code. No at all.
The proof would be being able to boot from ZFS with dedup turned on
eventhough ZFS boot code has 0 dedup code in it. Another proof would be
ZFS source code.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpOdlii40IHg.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-10 Thread Pawel Jakub Dawidek

On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
 Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. 
 
 The only vendor i know that can do this is Netapp 

And you really work at Oracle?:)

The answer is definiately yes. ARC caches on-disk blocks and dedup just
reference those blocks. When you read dedup code is not involved at all.
Let me show it to you with simple test:

Create a file (dedup is on):

# dd if=/dev/random of=/foo/a bs=1m count=1024

Copy this file so that it is deduped:

# dd if=/foo/a of=/foo/b bs=1m

Export the pool so all cache is removed and reimport it:

# zpool export foo
# zpool import foo

Now let's read one file:

# dd if=/foo/a of=/dev/null bs=1m
1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)

We read file 'a' and all its blocks are in cache now. The 'b' file
shares all the same blocks, so if ARC caches blocks only once, reading
'b' should be much faster:

# dd if=/foo/b of=/dev/null bs=1m
1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec)

Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity. Magic?:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp3hvtU1DibZ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] about btrfs and zfs

2011-10-19 Thread Pawel Jakub Dawidek

On Wed, Oct 19, 2011 at 08:40:59AM +1100, Peter Jeremy wrote:
 fsck verifies the logical consistency of a filesystem.  For UFS, this
 includes: used data blocks are allocated to exactly one file,
 directory entries point to valid inodes, allocated inodes have at
 least one link, the number of links in an inode exactly matches the
 number of directory entries pointing to that inode, directories form a
 single tree without loops, file sizes are consistent with the number
 of allocated blocks, unallocated data/inodes blocks are in the
 relevant free bitmaps, redundant superblock data is consistent.  It
 can't verify data.

Well said. I'd add that people who insist on ZFS having a fsck are
missing the whole point of ZFS transactional model and copy-on-write
design.

Fsck can only fix known file system inconsistencies in file system
structures. Because there is no atomicity of operations in UFS and other
file systems it is possible that when you remove a file, your system can
crash between removing directory entry and freeing inode or blocks.
This is expected with UFS, that's why there is fsck to verify that no
such thing happend.

In ZFS on the other hand there are no inconsistencies like that. If all
blocks match their checksums and you find directory loop or something
like that, it is a bug in ZFS, not expected inconsistency. It should be
fixed in ZFS and not work-arounded with some fsck for ZFS.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpXffQuNhb6M.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] about btrfs and zfs

2011-10-19 Thread Pawel Jakub Dawidek

On Wed, Oct 19, 2011 at 10:13:56AM -0400, David Magda wrote:
 On Wed, October 19, 2011 08:15, Pawel Jakub Dawidek wrote:
 
  Fsck can only fix known file system inconsistencies in file system
  structures. Because there is no atomicity of operations in UFS and other
  file systems it is possible that when you remove a file, your system can
  crash between removing directory entry and freeing inode or blocks.
  This is expected with UFS, that's why there is fsck to verify that no
  such thing happend.
 
 Slightly OT, but this non-atomic delay between meta-data updates and
 writes to the disk is exploited by soft updates with FreeBSD's UFS:
 
 http://www.freebsd.org/doc/en/books/handbook/configtuning-disk.html#SOFT-UPDATES
 
 It may be of some interest to the file system geeks on the list.

Well, soft-updates thanks to careful ordering of operation allow to
mount file system even in inconsistent state and run fsck in background,
as the only inconsistencies are resource leaks - directory entry will
never point at unallocated inode and an inode will never point at
unallocated block, etc. This is still not atomic.

With recent versions of FreeBSD, soft-updates were extended to journal
those resource leaks, so background fsck is not needed anymore.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgp1e542EIuks.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive and ashift

2011-07-27 Thread Pawel Jakub Dawidek

On Tue, Jul 26, 2011 at 03:28:10AM -0700, Fred Liu wrote:
 
  
  The ZFS Send stream is at the DMU layer at this layer the data is
  uncompress and decrypted - ie exactly how the application wants it.
  
 
 Even the data compressed/encrypted by ZFS will be decrypted? If it is true, 
 will it be any CPU overhead?
 And ZFS send/receive tunneled by ssh becomes the only way to encrypt the data 
 transmission?

Even if zfs send/recv will work with encrypted and compressed data you
still need some secure tunneling. Storage encryption is not the same as
network traffic encryption.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpJ1ymwQ3QWf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for Linux?

2011-06-15 Thread Pawel Jakub Dawidek

On Tue, Jun 14, 2011 at 04:15:17PM +0400, Jim Klimov wrote:
 Hello,
 
   A college friend of mine is using Debian Linux on his desktop,
 and wondered if he could tap into ZFS goodness without adding
 another server in his small quiet apartment or changing the
 desktop OS. According to his research, there are some kernel
 modules for Debian which implement ZFS, or a FUSE variant.
 
   Can anyone comment how stable and functional these are?
 Performance is a secondary issue, as long as it does not
 lead to system crashes due to timeouts, etc. ;)

If you would like to stay with Debian, you can try Debian GNU/kFreeBSD
with is Debian userland with FreeBSD kernel thus it should contain ZFS.

http://www.debian.org/ports/kfreebsd-gnu/

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpC4kjFjQYdh.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disk replacement need to scan full pool ?

2011-06-15 Thread Pawel Jakub Dawidek

On Tue, Jun 14, 2011 at 11:49:56AM -0700, Bill Sommerfeld wrote:
 On 06/14/11 04:15, Rasmus Fauske wrote:
  I want to replace some slow consumer drives with new edc re4 ones but
  when I do a replace it needs to scan the full pool and not only that
  disk set (or just the old drive)
  
  Is this normal ? (the speed is always slow in the start so thats not
  what I am wondering about, but that it needs to scan all of my 18.7T to
  replace one drive)
 
 This is normal.  The resilver is not reading all data blocks; it's
 reading all of the metadata blocks which contain one or more block
 pointers, which is the only way to find all the allocated data (and in
 the case of raidz, know precisely how it's spread and encoded across the
 members of the vdev).  And it's reading all the data blocks needed to
 reconstruct the disk to be replaced.

Maybe it would be faster to just offline this one disk, use dd(1) to
copy entire disk content, disconnect old disk on online the new one.
Not sure how well this will work on Solaris as the new disk serial
number won't match the one in metadata, but it will surely work on
FreeBSD.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpPVu54ziYQA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] best migration path from Solaris 10

2011-03-23 Thread Pawel Jakub Dawidek

On Sun, Mar 20, 2011 at 01:54:54PM +0700, Fajar A. Nugraha wrote:
 On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek p...@freebsd.org wrote:
  On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote:
  Newer versions of FreeBSD have newer ZFS code.
 
  Yes, we are at v28 at this point (the lastest open-source version).
 
  That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...]
 
  That's actually not true. There are more FreeBSD committers working on
  ZFS than on UFS.
 
 How is the performance of ZFS under FreeBSD? Is it comparable to that
 in Solaris, or still slower due to some needed compatibility layer?

This compatibility layer is just a bunch of ugly defines, etc. to allow
for less code modifications. It introduces no overhead.

I made performance comparison between FreeBSD 9 with ZFSv28 and Solaris
11 Express, but I don't think Solaris license allows me to publish the
results. But believe me, the results were very surprising:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpHCdMIWMoFb.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] best migration path from Solaris 10

2011-03-19 Thread Pawel Jakub Dawidek

On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote:
 Newer versions of FreeBSD have newer ZFS code.

Yes, we are at v28 at this point (the lastest open-source version).

 That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...]

That's actually not true. There are more FreeBSD committers working on
ZFS than on UFS.

 There are vendors who offer NexentaStor on hardware with full commercial
 support from a single vendor (granted they get backline support from
 Nexenta, but do you think ixSystems engineers personally fix bugs in
 FreeBSD?) [...]

iXsystems works very closely with the FreeBSD project. They hire or
contract quite a few FreeBSD committers (FYI I'm not one of them), so
yes, they are definitely in position to fix bugs in FreeBSD, as well as
develop new stuff and they do that.

Just wanted to clarify few points:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
FreeBSD committer http://www.FreeBSD.org
Am I Evil? Yes, I Am! http://yomoli.com


pgpOIZUClc1o8.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and TRIM

2011-02-04 Thread Pawel Jakub Dawidek

On Sat, Jan 29, 2011 at 11:31:59AM -0500, Edward Ned Harvey wrote:
 What is the status of ZFS support for TRIM?
[...]

I've no idea, but because I wanted to add such support for FreeBSD/ZFS
for a while now, I'll share my thoughts.

The problem is where to put those operations. ZFS internally have
ZIO_TYPE_FREE request, which represents exactly what we need - offset
and size to free. It would be best to just pass those requests directly
to VDEVs, but we can't do that. There might be transaction group that
will never be committed, because of a power failure and we TRIMed blocks
that we want to use after boot.
Ok, maybe we could just make such operation part of the transaction
group? No, we can't do that too. If we start committing transactions and
we execute TRIM operations we may still have power failure and TRIM
operations on old blocks cannot be undone, so we will get back to
invalid data.

So why not to move TRIM operations to the next transaction group? That's
doable, although we still need to be careful not to TRIM blocks that
were freed in the previous transaction group, but are reallocated in the
current one (or if we TRIM, we TRIM first and then write). Unfortunately
we don't want to TRIM blocks immediately. Take into account disks that
are lying about cache flush operation and because of that ZFS tries to
keep freed blocks from the few last transaction groups around, so you
can forcibly rewind to one of the previous txgs if such corruption occur.

My initial idea was to implement 100% reliable TRIM, so that I can
implement secure delete using it, eg. if ZFS is placed on top of disk
encryption layer, I can implement TRIM in this layer as 'overwrite the
given range with random data'. Making TRIM 100% reliable will be very
hard, IMHO.  But in most cases we don't need TRIM to be so perfect. My
current idea is to delay TRIM operation for some number of transaction
groups.  For example if block is freed in txg=5, I'll send TRIM for it
after txg=15 (if it wasn't reassigned in the meantime).  This is ok if
we crash before we get to txg=15, because the only side-effect is that
next write to this range might be a little slower.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpd4hVRMkn1v.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-15 Thread Pawel Jakub Dawidek

On Fri, Jan 14, 2011 at 11:32:58AM -0800, Peter Taps wrote:
 Ed,
 
 Thank you for sharing the calculations. In lay terms, for Sha256, how many 
 blocks of data would be needed to have one collision?
 
 Assuming each block is 4K is size, we probably can calculate the final data 
 size beyond which the collision may occur. This would enable us to make the 
 following statement:
 
 With Sha256, you need verification to be turned on only if you are dealing 
 with more than xxxT of data.

Except that this is wrong question to ask. The question you can ask is
How many blocks of data do I need so collision probability is X%?.

 Also, another related question. Why 256 bits was chosen and not 128 bits or 
 512 bits? I guess Sha512 may be an overkill. In your formula, how many blocks 
 of data would be needed to have one collision using Sha128?

There is no SHA128 and SHA512 has too long hash. Currently the maximum
hash ZFS can handle is 32 bytes (256 bits). Wasting another 32 bytes
without improving anything in practise wasn't probably worth it.

BTW. As for SHA512 being slower it looks like it depends on
implementation or SHA512 is faster to compute on 64bit CPU.
On my laptop OpenSSL computes SHA256 55% _slower_ than SHA512.
If this is a general rule, maybe it will be worth considering using
SHA512 truncated to 256 bits to get more speed.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpXQHlrciD1Y.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-10 Thread Pawel Jakub Dawidek

On Sat, Jan 08, 2011 at 12:59:17PM -0500, Edward Ned Harvey wrote:
 Has anybody measured the cost of enabling or disabling verification?

Of course there is no easy answer:)

Let me explain how verification works exactly first.

You try to write a block. You see that block is already in dedup table
(it is already referenced). You read the block (maybe it is in ARC or in
L2ARC). You compare read block with what you want to write.

Based on the above:
1. If you have dedup on, but your blocks are not deduplicable at all,
   you will pay no price for verification, as there will be no need to
   compare anything.
2. If your data is highly deduplicable you will verify often. Now it
   depends if the data you need to read fits into your ARC/L2ARC or not.
   If it can be found in ARC, the impact will be small.
   If your pool is very large and you can't count on ARC help, each
   write will be turned into a read.

Also note an interesting property of dedup: if your data is highly
deduplicable you can actually improve performance by avoiding data
writes (and just increasing reference count).
Let me show you three degenerated tests to compare options.
I'm writing 64GB of zeros to a pool with dedup turned off, with dedup turned on
and with dedup+verification turned on (I use SHA256 checksum everywhere):

# zpool create -O checksum=sha256 tank ada{0,1,2,3}
# time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; 
zpool export tank'
254,11 real 0,07 user40,80 sys

# zpool create -O checksum=sha256 -O dedup=on tank ada{0,1,2,3}
# time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; 
zpool export tank'
154,60 real 0,05 user37,10 sys

# zpool create -O checksum=sha256 -O dedup=sha256,verify tank 
ada{0,1,2,3}
# time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; 
zpool export tank'
173,43 real 0,02 user38,41 sys

As you can see in second and third test the data is of course in ARC, so the
difference here is only because of data comparison (no extra reads are needed)
and verification is 12% slower.

This is of course silly test, but as you can see dedup (even with verification)
is much faster than nodedup case, but this data is highly deduplicable:)

# zpool list
NAME   SIZE  ALLOC   FREECAP  DEDUP   HEALTH  ALTROOT
tank   149G  8,58M   149G 0%  524288.00x  ONLINE  -

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp3iTC1h5dwE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-09 Thread Pawel Jakub Dawidek

On Fri, Jan 07, 2011 at 03:06:26PM -0800, Brandon High wrote:
 On Fri, Jan 7, 2011 at 11:33 AM, Robert Milkowski mi...@task.gda.pl wrote:
  end-up with the block A. Now if B is relatively common in your data set you
  have a relatively big impact on many files because of one corrupted block
  (additionally from a fs point of view this is a silent data corruption).
  Without dedup if you get a single block corrupted silently an impact usually
  will be relatively limited.
 
 A pool can be configures so that a dedup'd block will only be
 referenced a certain number of times. So if you write out 10,000
 identical blocks, it may be written 10 times with each duplicate
 referenced 1,000 times. The exact number is controlled by the
 dedupditto property for your pool, and you should set it as your risk
 tolerance allows.

Dedupditto doesn't work exactly that way. You can have at most 3 copies
of your block. Dedupditto minimal value is 100. The first copy is
created on first write, the second copy is created on dedupditto
references and the third copy is created on 'dedupditto * dedupditto'
references. So once you reach 1 references of your block ZFS will
create three physical copies, not earlier and never more than three.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp8xQSJhrMH1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Pawel Jakub Dawidek

On Fri, Jan 07, 2011 at 07:33:53PM +, Robert Milkowski wrote:
  On 01/ 7/11 02:13 PM, David Magda wrote:
 
 Given the above: most people are content enough to trust Fletcher to not
 have data corruption, but are worried about SHA-256 giving 'data
 corruption' when it comes de-dupe? The entire rest of the computing world
 is content to live with 10^-15 (for SAS disks), and yet one wouldn't be
 prepared to have 10^-30 (or better) for dedupe?
 
 
 I think you are do not understand entirely the problem.
 Lets say two different blocks A and B have the same sha256 checksum, A 
 is already stored in a pool, B is being written. Without verify and 
 dedup enabled B won't be written. Next time you ask for block B you will 
 actually end-up with the block A. Now if B is relatively common in your 
 data set you have a relatively big impact on many files because of one 
 corrupted block (additionally from a fs point of view this is a silent 
 data corruption). [...]

All true, that's why verification was mandatory for fletcher, which is
not cryptographically strong hash. Until SHA256 is no broken, wasting
power for verification is just a waste of resources, which isn't green:)
Once SHA256 is broken, verification can be turned on.

 [...] Without dedup if you get a single block corrupted 
 silently an impact usually will be relatively limited.

Except when corruption happens on write, not read, ie. you write data,
it is corrupted on the fly, but corrupted version still matches fletcher
checksum (the default now). Now every read of this block will return
silently corrupted data.

 Now what if block B is a meta-data block?

Metadata is not deduplicated.

 The point is that a potential impact of a hash collision is much bigger 
 than a single silent data corruption to a block, not to mention that 
 dedup or not all the other possible cases of data corruption are there 
 anyway, adding yet another one might or might not be acceptable.

I'm more in opinion that it was mistake that the verification feature
wasn't removed along with fletcher-for-dedup removal. It is good to be
able to turn on verification once/if SHA256 will be broken - that's the
only reason I'll leave it, but I somehow feel that there are bigger
chances you can corrupt your data because of extra code complexity
coming with verification than because of SHA256 collision.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpDaDkDP6RK3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Recover data from detached ZFS mirror

2010-11-26 Thread Pawel Jakub Dawidek

On Thu, Nov 25, 2010 at 12:45:16AM -0800, maciej kaminski wrote:
 I've detached disk from a mirrored zpool using zpool detach (not zpool 
 split) command. Is it possible to recover data from that disk? If yes, how? 
 (and how to make it bootable)

Take a look at this thread:

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg15620.html

Jeff Bonwick provided a tool to recover ZFS label, which will allow to
import such detached vdev.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpsKzX4C5hl7.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pools inside pools

2010-09-22 Thread Pawel Jakub Dawidek

On Wed, Sep 22, 2010 at 02:06:27PM +, Markus Kovero wrote:
 Hi, I'm asking for opinions here, any possible disaster happening or 
 performance issues related in setup described below.
 Point being to create large pool and smaller pools within where you can 
 monitor easily iops and bandwidth usage without using dtrace or similar 
 techniques.
 
 1. Create pool
 
 # zpool create testpool mirror c1t1d0 c1t2d0
 
 2. Create volume inside a pool we just created
 
 # zfs create -V 500g testpool/testvolume
 
 3. Create pool from volume we just did
 
 # zpool create anotherpool /dev/zvol/dsk/testpool/testvolume
 
 After this, anotherpool can be monitored via zpool iostat nicely and 
 compression can be used in testpool to save resources without having 
 compression effect in anotherpool.
 
 zpool export/import seems to work, although flag -d needs to be used, are 
 there any caveats in this setup? How writes are handled?
 Is it safe to create pool consisting several ssd's and use volumes from it as 
 log-devices? Is it even supported?

Such configuration was known to cause deadlocks. Even if it works now
(which I don't expect to be the case) it will make your data to be
cached twice. The CPU utilization will also be much higher, etc.
All in all I strongly recommend against such setup.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpvkbwkkIhby.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hang on zpool import (dedup related)

2010-09-13 Thread Pawel Jakub Dawidek

On Sun, Sep 12, 2010 at 11:24:06AM -0700, Chris Murray wrote:
 Absolutely spot on George. The import with -N took seconds.
 
 Working on the assumption that esx_prod is the one with the problem, I bumped 
 that to the bottom of the list. Each mount was done in a second:
 
 # zfs mount zp
 # zfs mount zp/nfs
 # zfs mount zp/nfs/esx_dev
 # zfs mount zp/nfs/esx_hedgehog
 # zfs mount zp/nfs/esx_meerkat
 # zfs mount zp/nfs/esx_meerkat_dedup
 # zfs mount zp/nfs/esx_page
 # zfs mount zp/nfs/esx_skunk
 # zfs mount zp/nfs/esx_temp
 # zfs mount zp/nfs/esx_template
 
 And those directories have the content in them that I'd expect. Good!
 
 So now I try to mount esx_prod, and the influx of reads has started in   
 zpool iostat zp 1
 
 This is the filesystem with the issue, but what can I do now?

You could try to snapshot it (but keep it unmounted), then zfs send it
and zfs recv it to eg. zp/foo. Use -u option for zfs recv too, then try
to mount what you received.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpQxyW0TDNO3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs upgrade unmounts filesystems

2010-07-29 Thread Pawel Jakub Dawidek

On Thu, Jul 29, 2010 at 12:00:08PM -0600, Cindy Swearingen wrote:
 Hi Gary,
 
 I found a similar zfs upgrade failure with the device busy error, which
 I believe was caused by a file system mounted under another file system.
 
 If this is the cause, I will file a bug or find an existing one.
 
 The workaround is to unmount the nested file systems and upgrade them
 individually, like this:
 
 # zfs upgrade space/direct
 # zfs upgrade space/dcc

'zfs upgrade' unmounts file system first, which makes it hard to upgrade
for example root file system. The only work-around I found is to clone
root file system (clone is created with most recent version), change
root file system to newly created clone, reboot, upgrade original root
file system, change root file system back, reboot, destroy clone.

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpDwyEEJ9AAb.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Pawel Jakub Dawidek

On Thu, May 06, 2010 at 11:28:37AM +0100, Robert Milkowski wrote:
 With the put back of:
 
 [PSARC/2010/108] zil synchronicity
 
 zfs datasets now have a new 'sync' property to control synchronous 
 behaviour.
 The zil_disable tunable to turn synchronous requests into asynchronous
 requests (disable the ZIL) has been removed. For systems that use that 
 switch on upgrade
 you will now see a message on booting:
 
   sorry, variable 'zil_disable' is not defined in the 'zfs' module
 
 Please update your system to use the new sync property.
 Here is a summary of the property:
 
 ---
 
 The options and semantics for the zfs sync property:
 
 sync=standard
This is the default option. Synchronous file system transactions
(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
and then secondly all devices written are flushed to ensure
the data is stable (not cached by device controllers).
 
 sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by system call return.
This obviously has a big performance penalty.
 
 sync=disabled
Synchronous requests are disabled.  File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds.  This option gives the
highest performance, with no risk of corrupting the pool.
However, it is very dangerous as ZFS is ignoring the synchronous
 transaction
demands of applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior or application data
loss and increased vulnerability to replay attacks.
Administrators should only use this when these risks are understood.
 
 The property can be set when the dataset is created, or dynamically,
 and will take effect immediately.  To change the property, an
 administrator can use the standard 'zfs' command.  For example:
 
 # zfs create -o sync=disabled whirlpool/milek
 # zfs set sync=always whirlpool/perrin

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpnwVhYvicjy.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Pawel Jakub Dawidek

On Thu, May 06, 2010 at 01:15:41PM +0100, Robert Milkowski wrote:
 On 06/05/2010 13:12, Robert Milkowski wrote:
 On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:
 I read that this property is not inherited and I can't see why.
 If what I read is up-to-date, could you tell why?
 
 It is inherited. Sorry for the confusion but there was a discussion if 
 it should or should not be inherited, then we propose that it 
 shouldn't but it was changed again during a PSARC review that it should.
 
 And I did a copy'n'paste here.
 
 Again, sorry for the confusion.
 
 Well, actually I did copy'n'paste a proper page as it doesn't say 
 anything about inheritance.
 
 Nevertheless, yes it is inherited.

Yes, your e-mail didn't mention that and I wanted to clarify if what I
read in PSARC changed or not. Thanks:)

-- 
Pawel Jakub Dawidek   http://www.wheelsystems.com
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp3bNocGiTgs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Status/priority of 6761786

2009-08-31 Thread Pawel Jakub Dawidek

On Thu, Aug 27, 2009 at 01:37:11PM -0600, Dave wrote:
 Can anyone from Sun comment on the status/priority of bug ID 6761786? 
 Seems like this would be a very high priority bug, but it hasn't been 
 updated since Oct 2008.
 
 Has anyone else with thousands of volume snapshots experienced the hours 
 long import process?

It might not be direct ZFS fault. I tried to reproduce this on FreeBSD
and I was able to import pool with ~2000 ZVOLs and ~1 ZVOL snapshots
in few minutes. Those were empty ZVOLs and empty snapshots, so keep that
in mind. All in all creating /dev/ entries might be slow in Solaris
that's why experience this behaviour when importing ZFS pool with many
ZVOLs and many ZVOL snapshots (note that every ZVOL snapshot is a device
entry in /dev/zvol/, not like with file systems where snapshots are
mounted on .zfs/snapshot/name lookup and not on import time).

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpTg9d63ool5.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Need 1.5 TB drive size to use for array for testing

2009-08-22 Thread Pawel Jakub Dawidek

On Sat, Aug 22, 2009 at 12:00:42AM -0700, Jason Pfingstmann wrote:
 Thanks for the reply!
 
 The reason I'm not waiting until I have the disks is mostly because it will 
 take me several months to get the funds together and in the meantime, I need 
 the extra space 1 or 2 drives gets me.  Since the sparse files will only take 
 up the space in use, if I've migrated 2 of the sparse files to actual disk, I 
 should have enough storage for about 2 TB of data without risking running out 
 of space on the sparse file drive.  I know it'll be quirky and I'd need to 
 monitor the sparse file drive closely to insure it doesn't run out of room 
 (or risk unexpected results, possibly complete data loss depending on how ZFS 
 deals with that kind of problem).

It doesn't work exactly how you describe. ZFS cannot report back to the file
that the given block is free. Because of COW model, if you will modify your
pool a lot, blocks will be allocated in the spare files, but they will never be
released, so your spare files will only grow. You can end up with quite empty
pool and fully populated spare files.

As for the idea itself, it did something similar in the past when I was
changing pool layout - I created raidz2 vdev with two spare files, which I
removed immediately and two disks I saved I used as temporary storage. Once I
copied the data to the raidz2 destination pool, I added those two disks into
the holes and I let ZFS resliver to do its job.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpynGGk6F5W7.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] feature proposal

2009-07-29 Thread Pawel Jakub Dawidek

On Wed, Jul 29, 2009 at 05:34:53PM -0700, Roman V Shaposhnik wrote:
 On Wed, 2009-07-29 at 15:06 +0300, Andriy Gapon wrote:
  What do you think about the following feature?
  
  Subdirectory is automatically a new filesystem property - an 
  administrator turns
  on this magic property of a filesystem, after that every mkdir *in the 
  root* of
  that filesystem creates a new filesystem. The new filesystems have
  default/inherited properties except for the magic property which is off.
  
  Right now I see this as being mostly useful for /home. Main benefit in this 
  case
  is that various user administration tools can work unmodified and do the 
  right
  thing when an administrator wants a policy of a separate fs per user
  But I am sure that there could be other interesting uses for this.
 
 This feature request touches upon a very generic observation that my
 group made a long time ago: ZFS is a wonderful filesystem, the only
 trouble is that (almost) all the cool features have to be asked for
 using non-filesystem (POSIX) APIs. Basically everytime you have
 to do anything with ZFS you have to do it on a host where ZFS runs.
 
 The sole exception from this rule is .zfs subdirectory that lets you
 have access to snapshots without explicit calls to zfs(1M). 
 
 Basically .zfs subdirectory is your POSIX FS way to request two bits
 of ZFS functionality. In general, however, we all want more.
 
 On the read-only front: wouldn't it be cool to *not* run zfs sends 
 explicitly but have:
 .zfs/send/snap name
 .zfs/sendr/from-snap-name-to-snap-name
 give you the same data automagically? 
 
 On the read-write front: wouldn't it be cool to be able to snapshot
 things by:
 $ mkdir .zfs/snapshot/snap-name
 ?

Are you sure this doesn't work on Solaris/OpenSolaris? From looking at
the code you should be able to do exactly that as well as destroy
snapshot by rmdir'ing this entry.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
p...@freebsd.org   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpZJahRvw8OH.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cleaning user properties

2008-11-03 Thread Pawel Jakub Dawidek

On Mon, Nov 03, 2008 at 11:47:19AM +0100, Luca Morettoni wrote:
 I have a little question about user properties, I have two filesystems:
 
 rpool/export/home/luca
 and
 rpool/export/home/luca/src
 
 in this two I have one user property, setted with:
 
 zfs set net.morettoni:test=xyz rpool/export/home/luca
 zfs set net.morettoni:test=123 rpool/export/home/luca/src
 
 now I need to *clear* (remove) the property from 
 rpool/export/home/luca/src filesystem, but if I use the inherit 
 command I'll get the parent property, any hint to delete it?

You can't delete it, it's just how things work. I work-around it by
treating empty property and lack of property the same.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp772w99zeEG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] space_map.c 'ss == NULL' panic strikes back.

2007-11-14 Thread Pawel Jakub Dawidek

Hi.

Someone currently reported a 'ss == NULL' panic in
space_map.c/space_map_add() on FreeBSD's version of ZFS.

I found that this problem was previously reported on Solaris and is
already fixed. I verified it and FreeBSD's version have this fix in
place...


http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/fs/zfs/space_map.c?r2=3761r1=3713

I'd really like to help this guy get his data back, so please point me
into right direction. We have a crash dump of the panic, BTW.

It happened after a spontaneous reboot. Now, the system panics on
'zpool import' immediately.

He already tried two things:

1. Importing the pool with 'zpool import -o ro backup'. No luck, it
   crashes.

2. Importing the pool without mounting file systems (I sent him a patch
   to zpool, to not mount file systems automatically on pool import).
   I hoped that maybe only one or more file systems are corrupted, but
   no, it panics immediately as well.

It's the biggest storage machine in there, so there is no way to backup
raw disks before starting more experiments, that's why I'm writting
here. I've two ideas:

1. Because it happend on system crash or something, we can expect that
   this is caused by the last change. If so, we could try corrupting
   most recent uberblock, so ZFS will pick up previous uberblock.

2. Instead of pancing in space_map_add(), we could try to
   space_map_remove() the offensive entry, eg:

-   VERIFY(ss == NULL);
+   if (ss != NULL) {
+   space_map_remove(sm, ss-ss_start, ss-ss_end);
+   goto again;
+   }

Both of those ideas can make things worse, so I want to know what damage
can be done using those method, or even better, what else (safer) we can
try?

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp6Xm9y44G1x.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] About bug 6486493 (ZFS boot incompatible with

2007-10-08 Thread Pawel Jakub Dawidek

On Fri, Oct 05, 2007 at 08:52:17AM +0100, Robert Milkowski wrote:
 Hello Eric,
 
 Thursday, October 4, 2007, 5:54:06 PM, you wrote:
 
 ES On Thu, Oct 04, 2007 at 05:22:58AM -0700, Ivan Wang wrote:
   This bug was rendered moot via 6528732 in build
   snv_68 (and s10_u5).  We
   now store physical devices paths with the vnodes, so
   even though the
   SATA framework doesn't correctly support open by
   devid in early boot, we
  
  But if I read it right, there is still a problem in SATA framework 
  (failing ldi_open_by_devid,) right?
  If this problem is framework-wide, it might just bite back some time in 
  the future.
  
 
 ES Yes, there is still a bug in the SATA framework, in that
 ES ldi_open_by_devid() doesn't work early in boot.  Opening by device path
 ES works so long as you don't recable your boot devices.  If we had open by
 ES devid working in early boot, then this wouldn't be a problem.
 
 Even if someone re-cables sata disks couldn't we fallback to read zfs
 label from all available disks and find our pool and import it?

FreeBSD's GEOM storage framework implements a method called 'taste'.
When new disks arrives (or is closed after last write), GEOM calls taste
methods of all storage subsystems and subsystems can try to read their
metadata. This is bascially how autoconfiguration happens in FreeBSD for
things like software RAID1/RAID3/stripe/and others.
It's much easier than what ZFS does:
1. read /etc/zfs/zpool.cache
2. open components by name
3. if there is no such disk goto 5
4. verify diskid (not all disks have an ID)
5. if diskid doesn't match, try to lookup by ID

If there are few hundreds of disks, it may slows booting down, but it
was never a real problem in FreeBSD.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpnSvy49Vtnr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replacing a device with itself doesn't work

2007-10-08 Thread Pawel Jakub Dawidek

On Wed, Oct 03, 2007 at 10:02:03PM +0200, Pawel Jakub Dawidek wrote:
 On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote:
   -
   
   # zpool scrub tank
   # zpool status -v tank
 pool: tank
state: ONLINE
   status: One or more devices could not be used because the label is 
   missing or
   invalid.  Sufficient replicas exist for the pool to continue
   functioning in a degraded state.
   action: Replace the device using 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-4J
scrub: resilver completed with 0 errors on Wed Oct  3 18:45:06 2007
   config:
   
   NAMESTATE READ WRITE CKSUM
   tankONLINE   0 0 0
 raidz1ONLINE   0 0 0
   md0 UNAVAIL  0 0 0  corrupted data
   md1 ONLINE   0 0 0
   md2 ONLINE   0 0 0
   
   errors: No known data errors
   # zpool replace tank md0
   invalid vdev specification
   use '-f' to override the following errors:
   md0 is in use (r1w1e1)
   # zpool replace -f tank md0
   invalid vdev specification
   the following errors must be manually repaired:
   md0 is in use (r1w1e1)
   
   -
   Well the advice of 'zpool replace' doesn't work. At this point the user 
   is now stuck. There seems to
   be just no way to now use the existing device md0.
  
  In Solaris NV b72, this works as you expect.
  # zpool replace zwimming /dev/ramdisk/rd1
  # zpool status -v zwimming
 pool: zwimming
state: DEGRADED
scrub: resilver completed with 0 errors on Wed Oct  3 11:55:36 2007
  config:
  
   NAMESTATE READ WRITE CKSUM
   zwimmingDEGRADED 0 0 0
 raidz1DEGRADED 0 0 0
   replacing   DEGRADED 0 0 0
 /dev/ramdisk/rd1/old  FAULTED  0 0 0  corrupted 
  data
 /dev/ramdisk/rd1  ONLINE   0 0 0
   /dev/ramdisk/rd2ONLINE   0 0 0
   /dev/ramdisk/rd3ONLINE   0 0 0
  
  errors: No known data errors
  # zpool status -v zwimming
 pool: zwimming
state: ONLINE
scrub: resilver completed with 0 errors on Wed Oct  3 11:55:36 2007
  config:
  
   NAME  STATE READ WRITE CKSUM
   zwimming  ONLINE   0 0 0
 raidz1  ONLINE   0 0 0
   /dev/ramdisk/rd1  ONLINE   0 0 0
   /dev/ramdisk/rd2  ONLINE   0 0 0
   /dev/ramdisk/rd3  ONLINE   0 0 0
  
  errors: No known data errors
 
 Good to know, but I think it's still a bit of ZFS fault. The error
 message 'md0 is in use (r1w1e1)' means that something (I'm quite sure
 it's ZFS) keeps device open. Why does it keeps it open when it doesn't
 recognize it? Or maybe it tries to open it twice for write (exclusively)
 when replacing, which is not allowed in GEOM in FreeBSD.
 
 I can take a look if this is the former or the latter, but it should be
 fixed in ZFS itself, IMHO.

Ok, it seems that it was fixed in ZFS itself already:

/*
 * If we are setting the vdev state to anything but an open state, then
 * always close the underlying device.  Otherwise, we keep accessible
 * but invalid devices open forever.  We don't call vdev_close() itself,
 * because that implies some extra checks (offline, etc) that we don't
 * want here.  This is limited to leaf devices, because otherwise
 * closing the device will affect other children.
 */
if (vdev_is_dead(vd)  vd-vdev_ops-vdev_op_leaf)
vd-vdev_ops-vdev_op_close(vd);

The ZFS version from FreeBSD-CURRENT doesn't have this code yet, it's only in
my perforce branch for now. I'll verify later today if it really fixes the
problem and I'll report back if not.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpqzqbHn0DZG.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replacing a device with itself doesn't work

2007-10-03 Thread Pawel Jakub Dawidek

On Wed, Oct 03, 2007 at 12:10:19PM -0700, Richard Elling wrote:
  -
  
  # zpool scrub tank
  # zpool status -v tank
pool: tank
   state: ONLINE
  status: One or more devices could not be used because the label is 
  missing or
  invalid.  Sufficient replicas exist for the pool to continue
  functioning in a degraded state.
  action: Replace the device using 'zpool replace'.
 see: http://www.sun.com/msg/ZFS-8000-4J
   scrub: resilver completed with 0 errors on Wed Oct  3 18:45:06 2007
  config:
  
  NAMESTATE READ WRITE CKSUM
  tankONLINE   0 0 0
raidz1ONLINE   0 0 0
  md0 UNAVAIL  0 0 0  corrupted data
  md1 ONLINE   0 0 0
  md2 ONLINE   0 0 0
  
  errors: No known data errors
  # zpool replace tank md0
  invalid vdev specification
  use '-f' to override the following errors:
  md0 is in use (r1w1e1)
  # zpool replace -f tank md0
  invalid vdev specification
  the following errors must be manually repaired:
  md0 is in use (r1w1e1)
  
  -
  Well the advice of 'zpool replace' doesn't work. At this point the user 
  is now stuck. There seems to
  be just no way to now use the existing device md0.
 
 In Solaris NV b72, this works as you expect.
 # zpool replace zwimming /dev/ramdisk/rd1
 # zpool status -v zwimming
pool: zwimming
   state: DEGRADED
   scrub: resilver completed with 0 errors on Wed Oct  3 11:55:36 2007
 config:
 
  NAMESTATE READ WRITE CKSUM
  zwimmingDEGRADED 0 0 0
raidz1DEGRADED 0 0 0
  replacing   DEGRADED 0 0 0
/dev/ramdisk/rd1/old  FAULTED  0 0 0  corrupted 
 data
/dev/ramdisk/rd1  ONLINE   0 0 0
  /dev/ramdisk/rd2ONLINE   0 0 0
  /dev/ramdisk/rd3ONLINE   0 0 0
 
 errors: No known data errors
 # zpool status -v zwimming
pool: zwimming
   state: ONLINE
   scrub: resilver completed with 0 errors on Wed Oct  3 11:55:36 2007
 config:
 
  NAME  STATE READ WRITE CKSUM
  zwimming  ONLINE   0 0 0
raidz1  ONLINE   0 0 0
  /dev/ramdisk/rd1  ONLINE   0 0 0
  /dev/ramdisk/rd2  ONLINE   0 0 0
  /dev/ramdisk/rd3  ONLINE   0 0 0
 
 errors: No known data errors

Good to know, but I think it's still a bit of ZFS fault. The error
message 'md0 is in use (r1w1e1)' means that something (I'm quite sure
it's ZFS) keeps device open. Why does it keeps it open when it doesn't
recognize it? Or maybe it tries to open it twice for write (exclusively)
when replacing, which is not allowed in GEOM in FreeBSD.

I can take a look if this is the former or the latter, but it should be
fixed in ZFS itself, IMHO.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgprcvACVf6zj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS (and quota)

2007-10-01 Thread Pawel Jakub Dawidek

On Mon, Oct 01, 2007 at 12:57:05PM +0100, Robert Milkowski wrote:
 Hello Neil,
 
 Thursday, September 27, 2007, 11:40:42 PM, you wrote:
 
 
 NP Roch - PAE wrote:
  Pawel Jakub Dawidek writes:
I'm CCing zfs-discuss@opensolaris.org, as this doesn't look like
FreeBSD-specific problem.

It looks there is a problem with block allocation(?) when we are near
quota limit. tank/foo dataset has quota set to 10m:

Without quota:

   FreeBSD:
   # dd if=/dev/zero of=/tank/test bs=512 count=20480
   time: 0.7s

   Solaris:
   # dd if=/dev/zero of=/tank/test bs=512 count=20480
   time: 4.5s

With quota:

   FreeBSD:
   # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480
   dd: /tank/foo/test: Disc quota exceeded
   time: 306.5s

   Solaris:
   # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480
   write: Disc quota exceeded
   time: 602.7s

CPU is almost entirely idle, but disk activity seems to be high.

  
  
  Yes, as we are near quota limit, each transaction group
  will accept a small amount as to not overshoot the limit.
  
  I don't know if we have the optimal strategy yet.
  
  -r
 
 NP Aside from the quota perf issue, has any analysis been done as to
 NP why FreeBSD is over 6X faster than Solaris without quotas?
 NP Do other perf tests show a similar disparity?
 NP Is there a difference in dd itself?
 NP I assume that it was identical hardware and pool config.

(I don't see this e-mail in my ZFS inbox, that's why I'm replaying to
Robert's e-mail.)

Just to clarify. This was entirely different hardware. My e-mail was
__only__ about quota performance in ZFS. Please, do not try to use those
results for any other purpose.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpJlahVz5fg8.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS (and quota)

2007-09-21 Thread Pawel Jakub Dawidek

I'm CCing zfs-discuss@opensolaris.org, as this doesn't look like
FreeBSD-specific problem.

It looks there is a problem with block allocation(?) when we are near
quota limit. tank/foo dataset has quota set to 10m:

Without quota:

FreeBSD:
# dd if=/dev/zero of=/tank/test bs=512 count=20480
time: 0.7s

Solaris:
# dd if=/dev/zero of=/tank/test bs=512 count=20480
time: 4.5s

With quota:

FreeBSD:
# dd if=/dev/zero of=/tank/foo/test bs=512 count=20480
dd: /tank/foo/test: Disc quota exceeded
time: 306.5s

Solaris:
# dd if=/dev/zero of=/tank/foo/test bs=512 count=20480
write: Disc quota exceeded
time: 602.7s

CPU is almost entirely idle, but disk activity seems to be high.

Any ideas?

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp0eROCivYe1.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] The ZFS-Man.

2007-09-21 Thread Pawel Jakub Dawidek

Hi.

I gave a talk about ZFS during EuroBSDCon 2007, and because it won the
the best talk award and some find it funny, here it is:

http://youtube.com/watch?v=o3TGM0T1CvE

a bit better version is here:

http://people.freebsd.org/~pjd/misc/zfs/zfs-man.swf

BTW. Inspired by ZFS demos from OpenSolaris page I created few demos of
ZFS on FreeBSD:

http://youtube.com/results?search_query=freebsd+zfssearch=Search

And better versions:

http://people.freebsd.org/~pjd/misc/zfs/

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpe0ibMatzuw.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Evil Tuning Guide

2007-09-17 Thread Pawel Jakub Dawidek

On Mon, Sep 17, 2007 at 03:40:05PM +0200, Roch - PAE wrote:
 
 Tuning should not be done in general and Best practices
 should be followed.
 
 So get very much acquainted with this first :
 
   http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
 
 Then if you must, this could soothe or sting : 
 
   http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
 
 So drive carefully.

If some LUNs exposed to ZFS are not protected by NVRAM, then this
tuning can lead to data loss or application level corruption.  However
the ZFS pool integrity itself is NOT compromised by this tuning.

Are you sure? Once you turn off flushing cache, how can you tell that
your disk didn't reorder writes and uberblock was updated before new
blocks were written? Will ZFS go the the previous blocks when the newest
uberblock points at corrupted data?

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpLDBZ4zRFkC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-13 Thread Pawel Jakub Dawidek

On Thu, Sep 13, 2007 at 04:58:10AM +, Marc Bevand wrote:
 Pawel Jakub Dawidek pjd at FreeBSD.org writes:
  
  This is how RAIDZ fills the disks (follow the numbers):
  
  Disk0   Disk1   Disk2   Disk3
  
  D0  D1  D2  P3
  D4  D5  D6  P7
  D8  D9  D10 P11
  D12 D13 D14 P15
  D16 D17 D18 P19
  D20 D21 D22 P23
  
  D is data, P is parity.
 
 This layout assumes of course that large stripes have been written to
 the RAIDZ vdev. As you know, the stripe width is dynamic, so it is
 possible for a single logical block to span only 2 disks (for those who
 don't know what I am talking about, see the red block occupying LBAs
 D3 and E3 on page 13 of these ZFS slides [1]).

Yes I'm aware of that.

 To read this logical block (and validate its checksum), only D_0 needs 
 to be read (LBA E3). So in this very specific case, a RAIDZ read
 operation is as cheap as a RAID5 read operation. [...]

If you do single sector writes - yes, but this is very inefficient,
because of two reasons:
1. Bandwidth - writting one sector at a time? Come on.
2. Space - when you write one sector and its parity you consume two
   sectors. You may have more than one parity column in that case, eg.
Disk0   Disk1   Disk2   Disk3   Disk4   Disk5
D0  P0  D1  P1  D2  P2
   In this case space overhead is the same as in mirror.

 [...] The existence of these
 small stripes could explain why RAIDZ doesn't perform as bad as RAID5
 in Pawel's benchmark...

No, as I said, the smallest block I used was 2kB, which means four 512b
blocks plus one 512b of parity - each 2kB block uses all 5 disks.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpvqYkQFVjyQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Pawel Jakub Dawidek

On Wed, Sep 12, 2007 at 02:24:56PM -0700, Adam Leventhal wrote:
 On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
  And here are the results:
  
  RAIDZ:
  
  Number of READ requests: 4.
  Number of WRITE requests: 0.
  Number of bytes to transmit: 695678976.
  Number of processes: 8.
  Bytes per second: 1305213
  Requests per second: 75
  
  RAID5:
  
  Number of READ requests: 4.
  Number of WRITE requests: 0.
  Number of bytes to transmit: 695678976.
  Number of processes: 8.
  Bytes per second: 2749719
  Requests per second: 158
 
 I'm a bit surprised by these results. Assuming relatively large blocks
 written, RAID-Z and RAID-5 should be laid out on disk very similarly
 resulting in similar read performance.

Hmm, no. The data was organized very differenly on disks. The smallest
block size used was 2kB, to ensure each block is written to all disks in
RAIDZ configuration. In RAID5 configuration however, 128kB stripe size
was used, which means each block was stored on one disk only.

Now when you read the data, RAIDZ need to read all disks for each block,
and RAID5 needs to read only one disk for each block.

 Did you compare the I/O characteristic of both? Was the bottleneck in
 the software or the hardware?

The bottleneck were definiatelly disks. CPU was like 96% idle.

To be honest I expected, just like Jeff, much bigger win for RAID5 case.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpaN8zKnXp9n.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Pawel Jakub Dawidek

On Wed, Sep 12, 2007 at 11:20:52PM +0100, Peter Tribble wrote:
 On 9/10/07, Pawel Jakub Dawidek [EMAIL PROTECTED] wrote:
  Hi.
 
  I've a prototype RAID5 implementation for ZFS. It only works in
  non-degraded state for now. The idea is to compare RAIDZ vs. RAID5
  performance, as I suspected that RAIDZ, because of full-stripe
  operations, doesn't work well for random reads issued by many processes
  in parallel.
 
  There is of course write-hole problem, which can be mitigated by running
  scrub after a power failure or system crash.
 
 If I read your suggestion correctly, your implementation is much
 more like traditional raid-5, with a read-modify-write cycle?
 
 My understanding of the raid-z performance issue is that it requires
 full-stripe reads in order to validate the checksum. [...]

No, checksum is independent thing, and this is not the reason why RAIDZ
needs to do full-stripe reads - in non-degraded mode RAIDZ doesn't read
parity.

This is how RAIDZ fills the disks (follow the numbers):

Disk0   Disk1   Disk2   Disk3

D0  D1  D2  P3
D4  D5  D6  P7
D8  D9  D10 P11
D12 D13 D14 P15
D16 D17 D18 P19
D20 D21 D22 P23

D is data, P is parity.

And RAID5 does this:

Disk0   Disk1   Disk2   Disk3

D0  D3  D6  P0,3,6
D1  D4  D7  P1,4,7
D2  D5  D8  P2,5,8
D9  D12 D15 P9,12,15
D10 D13 D16 P10,13,16
D11 D14 D17 P11,14,17

As you can see even small block is stored on all disks in RAIDZ, where
on RAID5 small block can be stored on one disk only.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp5p7Tq85M8q.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Pawel Jakub Dawidek

On Wed, Sep 12, 2007 at 07:39:56PM -0500, Al Hopper wrote:
 This is how RAIDZ fills the disks (follow the numbers):
 
  Disk0   Disk1   Disk2   Disk3
 
  D0  D1  D2  P3
  D4  D5  D6  P7
  D8  D9  D10 P11
  D12 D13 D14 P15
  D16 D17 D18 P19
  D20 D21 D22 P23
 
 D is data, P is parity.
 
 And RAID5 does this:
 
  Disk0   Disk1   Disk2   Disk3
 
  D0  D3  D6  P0,3,6
  D1  D4  D7  P1,4,7
  D2  D5  D8  P2,5,8
  D9  D12 D15 P9,12,15
  D10 D13 D16 P10,13,16
  D11 D14 D17 P11,14,17
 
 Surely the above is not accurate?  You've showing the parity data only 
 being written to disk3.  In RAID5 the parity is distributed across all 
 disks in the RAID5 set.  What is illustrated above is RAID3.

It's actually RAID4 (RAID3 would look the same as RAIDZ, but there are
differences in practice), but my point wasn't how the parity is
distributed:) Ok, RAID5 once again:

Disk0   Disk1   Disk2   Disk3

D0  D3  D6  P0,3,6
D1  D4  D7  P1,4,7
D2  D5  D8  P2,5,8
D9  D12 P9,12,15D15
D10 D13 P10,13,16   D16
D11 D14 P11,14,17   D17

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpjnuDDD5adp.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-11 Thread Pawel Jakub Dawidek

On Tue, Sep 11, 2007 at 08:16:02AM +0100, Robert Milkowski wrote:
 Are you overwriting old data? I hope you're not...

I am, I overwrite parity, this is the whole point. That's why ZFS
designers used RAIDZ instead of RAID5, I think.

 I don't think you should suffer from above problem in ZFS due to COW.

I do, because autonomous blocks share the same parity block.

 If you are not overwriting and you're just writing to new locations
 from the pool perspective those changes (both new data block and
 checksum block) won't be active until they are both flushed and uber
 block is updated... right?

Assume 128kB stripe size in RAID5. You have three disks: A, B and C.
ZFS writes 128kB at offset 0. This makes RAID5 to write data into disk A
and parity into disk C (both at offset 0). Then, ZFS writes 128kB at
offset 128kB. RAID5 writes data into disk B (at offset 0) and updates
parity on disk C (also at offset 0).

As you can see, two independent ZFS blocks share one parity block.
COW won't help you here, you would need to be sure that each ZFS
transaction goes to a different (and free) RAID5 row.

This is I belive the main reason why poor RAID5 wasn't used in the first
place.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpVx1begmkQi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-10 Thread Pawel Jakub Dawidek

On Mon, Sep 10, 2007 at 04:31:32PM +0100, Robert Milkowski wrote:
 Hello Pawel,
 
 Excellent job!
 
 Now I guess it would be a good idea to get writes done properly,
 even if it means make them slow (like with SVM). The end result
 would be - do you want fast wrties/slow reads go ahead with
 raid-z; if you need fast reads/slow writes go with raid-5.

Writes in non-degraded mode already works. Only non-degraded mode
doesn't work. My implementation is based on RAIDZ, so I'm planning to
support RAID6 as well.

 btw: I'm just thinking loudly - for raid-5 writes, couldn't you
 somewhow utilize ZIL to make writes safe? I'm asking because we've
 got an ability to put zil somewhere else like NVRAM card...

The problem with RAID5 is that different blocks share the same parity,
which is not the case for RAIDZ. When you write a block in RAIDZ, you
write the data and the parity, and then you switch the pointer in
uberblock. For RAID5, you write the data and you need to update parity,
which also protects some other data. Now if you write the data, but
don't update the parity before a crash, you have a whole. If you update
you parity before the write and a crash, you have a inconsistent with
different block in the same stripe.

My idea was to have one sector every 1GB on each disk for a journal to
keep list of blocks beeing updated. For example you want to write 2kB of
data at offset 1MB. You first store offset+size in this journal, then
write data and update parity and then remove offset+size from the
journal.  Unfortuantely, we would need to flush write cache twice: after
offset+size addition and before offset+size removal.
We could optimize it by doing lazy removal, eg. wait for ZFS to flush
write cache as a part of transaction and then remove old offset+size
paris.
But I still expect this to give too much overhead.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpKARqkGHZjL.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Bad Blocks Handling

2007-08-28 Thread Pawel Jakub Dawidek

On Mon, Aug 27, 2007 at 10:00:10PM -0700, RL wrote:
 Hi,
 
 Does ZFS flag blocks as bad so it knows to avoid using them in the future?

No it doesn't. This would be a really nice feature to have, but
currently when ZFS tries to write to a bad sector it simply tries few
times and gives up. With COW model this shouldn't be very hard to try to
use another block and mark this one as bad, but it's not yet
implemented.

 During testing I had huge numbers of unrecoverable checksum errors, which I 
 resolved by disabling write caching on the disks.
 
 After doing this, and confirming the errors had stopped occuring, I removed 
 the test files. A few seconds after removing the test files, I noticed the 
 used space dropped from 16GB to 11GB according to 'df', but it did not appear 
 to ever drop below this value.
 
 Is this just normal file system overhead (This is a raidz with 8x 500GB 
 drives), or has ZFS not freed some of the space allocated to bad files?

Can you retry your test without write cache starting from recreating the
pool?

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpFBsjIFy6F3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New version of the ZFS test suite released

2007-08-04 Thread Pawel Jakub Dawidek

On Fri, Aug 03, 2007 at 10:56:53PM -0700, Jim Walker wrote:
 Version 1.8 of the ZFS test suite was released today on opensolaris.org.
 
 The ZFS test suite source tarballs, packages and baseline can be
 downloaded at:
 http://dlc.sun.com/osol/test/downloads/current/
 
 The ZFS test suite source can be browsed at:
 http://src.opensolaris.org/source/xref/test/ontest-stc2/src/suites/zfs/  
 
 More information on the ZFS test suite is at:
 http://opensolaris.org/os/community/zfs/zfstestsuite/
 
 Questions about the ZFS test suite can be sent to zfs-discuss at:
 http://www.opensolaris.org/jive/forum.jspa?forumID=80

Is it in mercurial repository? I'm not able to download it, but maybe
I'm using wrong path:

% hg clone ssh://[EMAIL PROTECTED]/hg/test/ontest-stc2 test
remote: Repository 'hg/test/ontest-stc2' inaccessible: No such file or 
directory.
abort: no suitable response from remote hg!

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpOxUl71BDgf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool import minor bug in snv_64a

2007-06-25 Thread Pawel Jakub Dawidek

On Mon, Jun 25, 2007 at 02:34:21AM -0400, Dennis Clarke wrote:
 
  in /usr/src/cmd/zpool/zpool_main.c :
 
 
 at line 680 forwards we can probably check for this scenario :
 
 if ( ( altroot != NULL )  ( altroot[0] != '/') ) {
 (void) fprintf(stderr, gettext(invalid alternate root '%s': 
 must be an absolute path\n), altroot);
 nvlist_free(nvroot);
 return (1);
 }
 
 /*  some altroot has been specified  *
  *  thus altroot[0] and altroot[1] exist */
 
 else if ( ( altroot[0] = '/')  ( altroot[1] = '\0') ) {

s/=/==/

 (void) fprintf(stderr, Do not specify / as alternate root.\n);

You need gettext() here.

 nvlist_free(nvroot);
 return (1);
 }
 
 
 not perfect .. but something along those lines.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpKWVUs2EH4y.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS Scalability/performance

2007-06-24 Thread Pawel Jakub Dawidek

On Sat, Jun 23, 2007 at 10:21:14PM -0700, Anton B. Rang wrote:
  Oliver Schinagl wrote:
   zo basically, what you are saying is that on FBSD there's no performane
   issue, whereas on solaris there (can be if write caches aren't enabled)
  
  Solaris plays it safe by default.  You can, of course, override that safety.
 
 FreeBSD plays it safe too.  It's just that UFS, and other file systems on 
 FreeBSD, understand write caches and flush at appropriate times.

That's not true. None of file systems in FreeBSD understands and flushes
disk write cache except for ZFS and UFS+gjournal.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpuN3mkKFpNW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Scalability/performance

2007-06-20 Thread Pawel Jakub Dawidek

On Wed, Jun 20, 2007 at 01:45:29PM +0200, Oliver Schinagl wrote:
 
 
 Pawel Jakub Dawidek wrote:
  On Tue, Jun 19, 2007 at 07:52:28PM -0700, Richard Elling wrote:

  On that note, i have a different first question to start with. I
  personally am a Linux fanboy, and would love to see/use ZFS on linux. I
  assume that I can use those ZFS disks later with any os that can
  work/recognizes ZFS correct? e.g.  I can install/setup ZFS in FBSD, and
  later use it in OpenSolaris/Linux Fuse(native) later?

  The on-disk format is an available specification and is designed to be
  platform neutral.  We certainly hope you will be able to access the
  zpools from different OSes (one at a time).
  
 
  Will be nice to not EFI label disks, though:) Currently there is a
  problem with this - zpool created on Solaris is not recognized by
  FreeBSD, because FreeBSD claims GPT label is corrupted. On the other
  hand, creating ZFS on FreeBSD (on a raw disk) can be used under Solaris.
 

 
 I read this earlier, that it's recommended to use a whole disk instead
 of a partition with zfs, the thing that's holding me back however is the
 mixture of different sized disks I have. I suppose if I had a 300gb per
 disk raid-z going on 3 300 disk and one 320gb disk, but only have a
 partition of 300gb on it (still with me), i could later expand that
 partition with fdisk and the entire raid-z would then expand to 320gb
 per disk (assuming the other disks magically gain 20gb, so this is a bad
 example in that sense :) )
 
 Also what about full disk vs full partition, e.g. make 1 partition to
 span the entire disk vs using the entire disk.
 Is there any significant performance penalty? (So not having a disk
 split into 2 partitions, but 1 disk, 1 partition) I read that with a
 full raw disk zfs will be beter to utilize the disks write cache, but I
 don't see how.

On FreeBSD (thanks to GEOM) there is no difference what do you have
under ZFS. On Solaris, ZFS turns on write cache on disk when whole disk
is used. On FreeBSD write cache is enabled by default and GEOM consumers
can send write-cache-flush (BIO_FLUSH) request to any GEOM providers.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpZkCuJUZmIl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Scalability/performance

2007-06-20 Thread Pawel Jakub Dawidek

On Wed, Jun 20, 2007 at 09:48:08AM -0700, Eric Schrock wrote:
 On Wed, Jun 20, 2007 at 12:45:52PM +0200, Pawel Jakub Dawidek wrote:
  
  Will be nice to not EFI label disks, though:) Currently there is a
  problem with this - zpool created on Solaris is not recognized by
  FreeBSD, because FreeBSD claims GPT label is corrupted. On the other
  hand, creating ZFS on FreeBSD (on a raw disk) can be used under Solaris.
  
 
 FYI, the primary reason for using EFI labels is that they are
 endian-neutral, unlike Solaris VTOC.  The secondary reason is that they
 are simpler and easier to use (at least on Solaris).
 
 I'm curious why FreeBSD claims the GPT label is corrupted.  Is this
 because FreeBSD doesn't understand EFI labels, our EFI label is bad, or
 is there a bug in the FreeBSD EFI implementation?

I haven't investigated this yet. FreeBSD should understand EFI, so
either the last two or a bug in Solaris EFI implementation:) I seem to
recall similar problems on Linux with ZFS/FUSE...

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpd81Zg8xdCo.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Preparing to compare Solaris/ZFS and FreeBSD/ZFS performance.

2007-05-24 Thread Pawel Jakub Dawidek

On Thu, May 24, 2007 at 11:20:44AM +0100, Darren J Moffat wrote:
 Pawel Jakub Dawidek wrote:
 Hi.
 I'm all set for doing performance comparsion between Solaris/ZFS and
 FreeBSD/ZFS. I spend last few weeks on FreeBSD/ZFS optimizations and I
 think I'm ready. The machine is 1xQuad-core DELL PowerEdge 1950, 2GB
 RAM, 15 x 74GB-FC-10K accesses via 2x2Gbit FC links. Unfortunately the
 links to disks are the bottleneck, so I'm going to use not more than 4
 disks, probably.
 I do know how to tune FreeBSD properly, but I don't know much about
 Solaris tunning. I just upgraded Solaris to:
  SunOS lab14.wheel.pl 5.11 opensol-20070521 i86pc i386 i86pc
 I took upgrades from:
  http://dlc.sun.com/osol/on/downloads/current/
 I believe this is a version with some debugging options turned on. How
 can I turn debug off? Can I or do I need to install something else?
 What other tunnings should I apply?
 
 Don't install from bfu archives instead install a Solaris Express directly 
 from a DVD image.
 
 Or if you do want to use bfu because you really want to match your source 
 code revisions up to a given day then you will need to build the ON 
 consolidation yourself and you 
 an the install the non debug bfu archives (note you will need to download the 
 non debug closed bins to do that).
 
 Easiest way is to just use a DVD install.

Ha, I originally installed from sol-nv-b55b-x86-dvd-iso-[a-e].zip, but
then upgraded to OpenSolaris.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpi5q9PL2OoS.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Preparing to compare Solaris/ZFS and FreeBSD/ZFS performance.

2007-05-24 Thread Pawel Jakub Dawidek

On Thu, May 24, 2007 at 01:16:32PM +0200, Claus Guttesen wrote:
 I'm all set for doing performance comparsion between Solaris/ZFS and
 FreeBSD/ZFS. I spend last few weeks on FreeBSD/ZFS optimizations and I
 think I'm ready. The machine is 1xQuad-core DELL PowerEdge 1950, 2GB
 RAM, 15 x 74GB-FC-10K accesses via 2x2Gbit FC links. Unfortunately the
 links to disks are the bottleneck, so I'm going to use not more than 4
 disks, probably.
 
 I do know how to tune FreeBSD properly, but I don't know much about
 Solaris tunning. I just upgraded Solaris to:
 
 I have just (re)installed FreeBSD amd64 current with gcc 4.2 with src
 from May. 21'st on a dual Dell PE 2850.  Does the post-gcc-4-2 current
 include all your zfs-optimizations?
 
 I have commented out INVARIANTS, INVARIANTS_SUPPORT, WITNESS and
 WITNESS_SKIPSPIN in my kernel and recompiled with CPUTYPE=nocona.
 
 A few weeks ago I installed FreeBSD but it panicked when I used
 iozone. So I installed solaris 10 on this box and wanted to keep it
 that way. But solaris lacks FreeBSD ports ;-) so when current upgraded
 gcc to 4.2 I re-installed FreeBSD and the box is so far very stable.
 
 I have imported a 3.9 GB compressed postgresql dump five times to tune
 io-performance, have copied 66 GB of data from another server using
 nfs, installed 117 packages from the ports-collection and it's *very*
 stable.
 
 A default install solaris fares better io-wise compared to a default
 FreeBSD where writes could pass 100 MB/s (zpool iostat 1) and FreeBSD
 would write 30-40 MB/s. After adding the following to
 /boot/loader.conf writes peak at 90-95 MB/s:
 
 vm.kmem_size_max=2147483648
 vfs.zfs.arc_max=1610612736
 
 Now FreeBSD seems to perfom almost as good as solaris io-wise although
 I don't have any numbers to justify my statement. I did not import
 postgresql in solaris as one thing.
 
 Copying the 3.9 GB dump from $HOME to a subdir takes 1 min. 13 secs.
 which is approx. 55 MB/s. Reads peaked at 115 MB/s.
 
 The storage is a atabeast with two raid-controllers connected via two
 qlogic 2300 hba's. Each controller have four raid5-arrays with five
 400 GB disks each.
 
 zetta~#zpool status
  pool: disk1
 state: ONLINE
 scrub: scrub completed with 0 errors on Thu May 24 21:39:46 2007
 config:
 
NAMESTATE READ WRITE CKSUM
disk1   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
da0 ONLINE   0 0 0
da1 ONLINE   0 0 0
da4 ONLINE   0 0 0
da5 ONLINE   0 0 0
  raidz1ONLINE   0 0 0
da2 ONLINE   0 0 0
da3 ONLINE   0 0 0
da6 ONLINE   0 0 0
da7 ONLINE   0 0 0
 
 errors: No known data errors
 
 
 The atabeast is not the fastest storage-provider around but on this
 machine will primarily be a file- and mail-server.
 
 Are there any other tunables on FreeBSD I can look at?

There is probably not much you can do to tune sequential I/Os. I'd
suggest starting investigation from benchmarking drivers on both
systems, by using raw disks (without ZFS).
There are some other things you could try to improve different
workloads.

To improve concurrency you should use shared locks for VFS lookups:

# sysctl vfs.lookup_shared=1

This patch also improve concurrency in VFS:

http://people.freebsd.org/~pjd/patches/vfs_shared.patch

When you want to operate on mmap(2)ed files, you should disable ZIL and
remote file systems:

# sysctl vfs.zfs.zil_disable=1
# zpool export name
# zpool import name

I think ZIL should be dataset property, as differences depending on the
workload are huge. For example fsx test is like 15 _times_ faster when
ZIL is disabled.

There are still some things to optimize, like using UMA for memory
allocations, but we run out of KVA too fast then.

Benchmarking file system is not easy, as there are other subsystems
involved, like namecache or VM. fsstress test, which mostly operates on
metadata (creates, removes files, directories, renames them, etc.) is 3
times faster on FreeBSD/ZFS than on Solaris/ZFS, but I believe it's
mostly because of namecache implementation. Solaris guys should
seriously look at improving DNLC or replacing it. Another possibility is
VFS, but Solaris VFS is much cleaner, and I somehow don't believe it's
slower. fsx is about 20% faster on FreeBSD, this could be VM's fault.
Don't take this numbers too seriously - those were only first tries to
see where my port is and I was using OpenSolaris for comparsion, which
has debugging turned on.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpylZji1JT6R.pgp
Description: PGP signature

[zfs-discuss] Preparing to compare Solaris/ZFS and FreeBSD/ZFS performance.

2007-05-23 Thread Pawel Jakub Dawidek

Hi.

I'm all set for doing performance comparsion between Solaris/ZFS and
FreeBSD/ZFS. I spend last few weeks on FreeBSD/ZFS optimizations and I
think I'm ready. The machine is 1xQuad-core DELL PowerEdge 1950, 2GB
RAM, 15 x 74GB-FC-10K accesses via 2x2Gbit FC links. Unfortunately the
links to disks are the bottleneck, so I'm going to use not more than 4
disks, probably.

I do know how to tune FreeBSD properly, but I don't know much about
Solaris tunning. I just upgraded Solaris to:

SunOS lab14.wheel.pl 5.11 opensol-20070521 i86pc i386 i86pc

I took upgrades from:

http://dlc.sun.com/osol/on/downloads/current/

I believe this is a version with some debugging options turned on. How
can I turn debug off? Can I or do I need to install something else?
What other tunnings should I apply?

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpy9uZcwLmAl.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?

2007-05-04 Thread Pawel Jakub Dawidek

On Thu, May 03, 2007 at 02:15:45PM -0700, Bakul Shah wrote:
 [originally reported for ZFS on FreeBSD but Pawel Jakub Dawid
  says this problem also exists on Solaris hence this email.]

Thanks!

 Summary: on ZFS, overhead for reading a hole seems far worse
 than actual reading from a disk.  Small buffers are used to
 make this overhead more visible.
 
 I ran the following script on both ZFS and UF2 filesystems.
 
 [Note that on FreeBSD cat uses a 4k buffer and md5 uses a 1k
  buffer. On Solaris you can replace them with dd with
  respective buffer sizes for this test and you should see
  similar results.]
 
 $ dd /dev/zero bs=1m count=10240 SPACY# 10G zero bytes allocated
 $ truncate -s 10G HOLEY   # no space allocated
 
 $ time dd SPACY /dev/null bs=1m # A1
 $ time dd HOLEY /dev/null bs=1m # A2
 $ time cat SPACY /dev/null   # B1
 $ time cat HOLEY /dev/null   # B2
 $ time md5 SPACY  # C1
 $ time md5 HOLEY  # C2
 
 I have summarized the results below.
 
 ZFSUFS2
   Elapsed System  Elapsed System Test
 dd SPACY bs=1m  110.26   22.52340.38   19.11  A1
 dd HOLEY bs=1m   22.44   22.41 24.24   24.13  A2
 
 cat SPACY 119.64   33.04  342.77   17.30  B1
 cat HOLEY 222.85  222.08   22.91   22.41  B2
 
 md5 SPACY 210.01   77.46  337.51   25.54  C1  
 md5 HOLEY 856.39  801.21   82.11   28.31  C2

This is what I see on Solaris (hole is 4GB):

# /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k
real   23.7
# /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k
real   21.2

# /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k
real   31.4
# /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k
real 7:32.2

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpHFXMS6aW7i.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: zfs performance on fuse (Linux) compared to other fs

2007-04-24 Thread Pawel Jakub Dawidek

On Mon, Apr 23, 2007 at 11:42:41PM -0700, Georg-W. Koltermann wrote:
   So, at this point in time that seems pretty
  discouraging for an everyday user, on Linux.
  
  nobody told, that zfs-fuse is ready for an everyday
  user at it`s current state ! ;)
 
 That's what I found out, wanted to share and get other's opinion on.
 
 I did not complain.  I thought it might work, it might not, so I tried.
 
 BTW last night I tried ZFS on FreeBSD 7.  I got a panic when trying to make it
 import my existing pool at first.  [...]

Can I see the panic message and backtrace?

 [...] Then I tried again another way and did get it to 
 recognize it. My simple, non-representative performance measurement was even
 slower than zfs-fuse (something like 4-5 minutes for the find, no apparent 
 caching
 effect), and I had many USB read errors along the way as well.  It looks like
 FBSD 7 with ZFS is even more immature than zfs-fuse at this time.  That's ok, 
 it is a CVS snapshot of FreeBSD CURRENT after all.

First of all CURRENT snapshot comes with a kernel, which contains some
heavely debugging options turned on by default. Turning off WITNESS
should make the ZFS works few times faster. find was the only test you
tried? Currently I'm using ported DNLC namecache, but I've a working
code already that uses FreeBSD's namecache and it performs much better
for such a test.

There were few nits after the import, which are all (or most of them)
fixed at this point, but I've a huge number of reports from the users
that ZFS works very stable on FreeBSD. If you could reproduce the panic
and send me info I'd be grateful.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpDNnOyKEpF0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS committed to the FreeBSD base.

2007-04-08 Thread Pawel Jakub Dawidek

On Sun, Apr 08, 2007 at 08:03:11AM +0200, Bruno Damour wrote:
 hello,
 
 After csup, buildworld fails for me in libumem.
 Is this due to zfs import ?
 Or my config ?
 
 Thanks for any clue, i'm dying to try your brand new zfs on amd64 !!
 
 Bruno
 
 FreeBSD vil1.ruomad.net 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Fri Mar 23 
 07:33:56 CET 2007 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/VIL1  amd64
 
 make buildworld:
 
 === cddl/lib/libumem (all)
 cc -O2 -fno-strict-aliasing -pipe -march=nocona 
 -I/usr/src/cddl/lib/libumem/../../../compat/opensolaris/lib/libumem 
 -D_SOLARIS_C_SOURCE  -c /usr/src/cddl/lib/libumem/umem.c
 /usr/src/cddl/lib/libumem/umem.c:197: error: redefinition of 'nofail_cb'
 /usr/src/cddl/lib/libumem/umem.c:30: error: previous definition of 
 'nofail_cb' was here
 /usr/src/cddl/lib/libumem/umem.c:199: error: redefinition of `struct 
 umem_cache'
 /usr/src/cddl/lib/libumem/umem.c:210: error: redefinition of 'umem_alloc'
 /usr/src/cddl/lib/libumem/umem.c:43: error: previous definition of 
 'umem_alloc' was here

Did you use my previous patches? There is no cddl/lib/libumem/umem.c is
HEAD, it was it's old location and it was moved to
compat/opensolaris/lib/libumem/. Delete your entire cddl/ directory and
recsup.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpdFOvvrz3lO.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS committed to the FreeBSD base.

2007-04-06 Thread Pawel Jakub Dawidek

On Fri, Apr 06, 2007 at 05:54:37AM +0100, Ricardo Correia wrote:
 I'm interested in the cross-platform portability of ZFS pools, so I have
 one question: did you implement the Solaris ZFS whole-disk support
 (specifically, the creation and recognition of the EFI/GPT label)?
 
 Unfortunately some tools in Linux (parted and cfdisk) have trouble
 recognizing the EFI partition created by ZFS/Solaris..

I'm not yet setup to move disks between FreeBSD and Solaris, but my
first goal was to integrate it with FreeBSD's GEOM framework.

We support cache flushing operations on any GEOM provider (disk,
partition, slice, anything disk-like), so bascially currently I treat
everything as a whole disk (because I simply can), but don't do any
EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD
and see what happen.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpIAGy8NZuKt.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS committed to the FreeBSD base.

2007-04-06 Thread Pawel Jakub Dawidek

On Fri, Apr 06, 2007 at 01:29:11PM +0200, Pawel Jakub Dawidek wrote:
 On Fri, Apr 06, 2007 at 05:54:37AM +0100, Ricardo Correia wrote:
  I'm interested in the cross-platform portability of ZFS pools, so I have
  one question: did you implement the Solaris ZFS whole-disk support
  (specifically, the creation and recognition of the EFI/GPT label)?
  
  Unfortunately some tools in Linux (parted and cfdisk) have trouble
  recognizing the EFI partition created by ZFS/Solaris..
 
 I'm not yet setup to move disks between FreeBSD and Solaris, but my
 first goal was to integrate it with FreeBSD's GEOM framework.
 
 We support cache flushing operations on any GEOM provider (disk,
 partition, slice, anything disk-like), so bascially currently I treat
 everything as a whole disk (because I simply can), but don't do any
 EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD
 and see what happen.

First try:

GEOM: ad6: corrupt or invalid GPT detected.
GEOM: ad6: GPT rejected -- may not be recoverable.

:)

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp7PdZhefXya.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: ZFS committed to the FreeBSD base.

2007-04-06 Thread Pawel Jakub Dawidek

On Sat, Apr 07, 2007 at 12:39:14AM +0200, Bruno Damour wrote:
 Thanks, fantasticly interesting !
   Currently ZFS is only compiled as kernel module and is only available
   for i386 architecture. Amd64 should be available very soon, the other
   archs will come later, as we implement needed atomic operations.
   
 I'm waiting eagerly to amd64 version
 
 Missing functionality.
 
   - There is no support for ACLs and extended attributes.
   
 Is this planned ? Does that means I cannot use it as a basis for a 
 full-featured samba share ?

It is planned, but it's not trivial. Does samba support NFSv4-style
ACLs?

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpbbjVRCmVwa.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Something like spare sectors...

2007-04-05 Thread Pawel Jakub Dawidek

Hi.

What do you think about adding functionality similar to disk's spare
sectors - if a sector die, a new one is assigned from the spare sectors
pool. This will be very helpful especially for laptops, where you have
only one disk. I simulated returning EIO for one sector from a one-disk
pool and as you know system paniced:

panic: ZFS: I/O failure (write on unknown off 0: zio 0xc436d400 [L0 zvol 
object] 2000L/2000P DVA[0]=0:4000:2000 fletcher2 uncompressed LE contiguous 
birth=11 fill=1 
cksum=90519dcb617667ac:e96316f8a73d7efc:8ca812fc04509f9b:9b9632c6959cbd71): 
error 5

From what I saw, ZFS retried to write to this sector once again before
panicing, but why not just try another block? And maybe remember the
problematic block somewhere. Of course this won't safe us when read
operation fails, but should work quite well for writes.

Not sure how vdev_mirror works exactly, ie. if it needs both mirror
components to be identical or if the only guaranty is that they have the
same data, but not exactly in the same place. If the latter, proposed
mechanism could be also used as a part of the self-healing process, I
think.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpsXiWvpsM1G.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS committed to the FreeBSD base.

2007-04-05 Thread Pawel Jakub Dawidek

Hi.

I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system. ZFS is available in the HEAD branch and will be
available in FreeBSD 7.0-RELEASE as an experimental feature.

Commit log:

  Please welcome ZFS - The last word in file systems.
  
  ZFS file system was ported from OpenSolaris operating system. The code
  in under CDDL license.
  
  I'd like to thank all SUN developers that created this great piece of
  software.
  
  Supported by: Wheel LTD (http://www.wheel.pl/)
  Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
  Supported by: Sentex (http://www.sentex.net/)

Limitations.

  Currently ZFS is only compiled as kernel module and is only available
  for i386 architecture. Amd64 should be available very soon, the other
  archs will come later, as we implement needed atomic operations.

Missing functionality.

  - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
iSCSI is also not supported at this point. This should be fixed in
the future, we may also add support for sharing ZVOLs over ggate.
  - There is no support for ACLs and extended attributes.
  - There is no support for booting off of ZFS file system.

Other than that, ZFS should be fully-functional.

Enjoy!

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpOhwEO3qYF2.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] User-defined properties.

2007-04-01 Thread Pawel Jakub Dawidek

Hi.

How a user-defined property can be removed? I can't find a way...

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpUfHYws9m6z.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] User-defined properties.

2007-04-01 Thread Pawel Jakub Dawidek

On Sun, Apr 01, 2007 at 12:03:36PM -0700, Eric Schrock wrote:
 You should be able to get rid of it with 'zfs inherit'.  It's not
 exactly intuitive, but it matches the native property behavior.  If you
 have any advice for improving documentation, plese let us know.

Indeed, but I was more looking for something as simple as 'zfs del
property filesystem'. Your method won't work in this situation:

# zfs create tank/foo
# zfs create tank/foo/bar
# zfs set org.freebsd:test=test tank/foo
# zfs get -r org.freebsd:test tank/foo
NAME  PROPERTY  VALUE SOURCE
tank/foo  org.freebsd:test  test  local
tank/foo/bar  org.freebsd:test  test  inherited from 
tank/foo

Now how to remove it only from tank/foo/bar? Let's assume that I've many
datasets under tank/foo/ I don't want to remove the property from
tank/foo and add it to each dataset.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp6rG6hugf38.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] User-defined properties.

2007-04-01 Thread Pawel Jakub Dawidek

On Sun, Apr 01, 2007 at 02:20:29PM -0700, Eric Schrock wrote:
 This can't be done due to the way ZFS property inheritance works in the
 DSL.  You can explicitly set it to the empty string, but you can't unset
 the property alltogether.  This is exactly why the 'zfs get -s local'
 option exists, so you can find only locally-set properties.

Ok, thanks!

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpJFnq1tRBAI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

2007-03-23 Thread Pawel Jakub Dawidek

On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote:
 Again, thanks to devids, the autoreplace code would not kick in here at
 all.  You would end up with an identical pool.

Eric, maybe I'm missing something, but why ZFS depend on devids at all?
As I understand it, devid is something that never change for a block
device, eg. disk serial number, but on the other hand it is optional, so
we can rely on the fact it's always there (I mean for all block devices
we use).

Why we simply not forget about devids and just focus on on-disk metadata
to detect pool components?

The only reason I see is performance. This is probably why
/etc/zfs/zpool.cache is used as well.

In FreeBSD we have the GEOM infrastructure for storage. Each storage
device (disk, partition, mirror, etc.) is simply a GEOM provider. If
GEOM provider appears (eg. disk is inserted, partition is configured)
all interested parties are informed about this I can 'taste' the
provider by reading metadata specific for them. The same when provider
goes away - all interested parties are informed and can react
accordingly.

We don't see any performance problems related to the fact that each disk
that appears is read by many GEOM classes.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpm6A6Tnggir.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

2007-03-23 Thread Pawel Jakub Dawidek

On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek wrote:
 On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote:
  Again, thanks to devids, the autoreplace code would not kick in here at
  all.  You would end up with an identical pool.
 
 Eric, maybe I'm missing something, but why ZFS depend on devids at all?
 As I understand it, devid is something that never change for a block
 device, eg. disk serial number, but on the other hand it is optional, so
 we can rely on the fact it's always there (I mean for all block devices

s/can/can't/

 we use).
 
 Why we simply not forget about devids and just focus on on-disk metadata
 to detect pool components?
 
 The only reason I see is performance. This is probably why
 /etc/zfs/zpool.cache is used as well.
 
 In FreeBSD we have the GEOM infrastructure for storage. Each storage
 device (disk, partition, mirror, etc.) is simply a GEOM provider. If
 GEOM provider appears (eg. disk is inserted, partition is configured)
 all interested parties are informed about this I can 'taste' the
 provider by reading metadata specific for them. The same when provider
 goes away - all interested parties are informed and can react
 accordingly.
 
 We don't see any performance problems related to the fact that each disk
 that appears is read by many GEOM classes.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpMjTSvCwNLk.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] suggestion: directory promotion to filesystem

2007-02-21 Thread Pawel Jakub Dawidek

On Wed, Feb 21, 2007 at 10:11:43AM -0800, Matthew Ahrens wrote:
 Adrian Saul wrote:
 Not hard to work around - zfs create and a mv/tar command and it is
 done... some time later.  If there was say  a zfs graft directory
 newfs command, you could just break of the directory as a new
 filesystem and away you go - no copying, no risking cleaning up the
 wrong files etc.
 
 Yep, this idea was previously discussed on this list -- search for zfs 
 split and see the following RFE:
 
 6400399 want zfs split
 
 zfs join was also discussed but I don't think it's especially feasible or 
 useful.

'zfs join' can be hard because of inode number collisions, but may be
useful. Imagine a situation that you have the following file systems:

/tank
/tank/foo
/tank/bar

and you want to move huge amount of data from /tank/foo to /tank/bar.
If you use mv/tar/dump it will copy entire data. Much faster will be to
'zfs join tank tank/foo  zfs join tank tank/bar' then just mv the data
and 'zfs split' them back:)

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpU7idVrPav6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: crypto properties (Was: Re: [zfs-discuss] ZFS inode equivalent)

2007-02-02 Thread Pawel Jakub Dawidek

On Fri, Feb 02, 2007 at 08:46:34AM +, Darren J Moffat wrote:
 Pawel Jakub Dawidek wrote:
 On Thu, Feb 01, 2007 at 11:00:07AM +, Darren J Moffat wrote:
 Neil Perrin wrote:
 No it's not the final version or even the latest!
 The current on disk format version is 3. However, it hasn't
 diverged much and the znode/acl stuff hasn't changed.
 and it will get updated as part of zfs-crypto, I just haven't done so yet 
 because I'm not finished designing yet.
 Do you consider adding a new property type (next to readonly and
 inherit) - a oneway property? Such propery could be only set if the
 dataset has no children, no snapshots and no data, and once set can't be
 modified. oneway would be the type of the encryption property.
 On the other hand you may still want to support encryption algorithm
 change and most likely key change.
 
 I'm not sure I understand what you are asking for.

I'm sorry it seems I started my explanations from too deep. I started to
play with encryption on my own by creating a crypto compression
algorithm.  Currently there are few types of property (readonly,
inherited, etc.), but non of them seems to be suitable for encryption.
When you enable encryption there should be no data, or you know that
existing data is going to be encrypted and plaintext data securely
removed automatically. Of course the later is much more complex to
implement.

 My current plan is that once set the encryption property that describes which 
 algorithm (mechanism actually: algorithm, key length and mode, eg 
 aes-128-ccm) can not be 
 changed, it would be inherited by any clones. Creating new child file systems 
 rooted in an encrypted filesystem you would be allowed to turn if off (I'd 
 like to have a 
 policy like the acl one here) but by default it would be inherited.

Right. I forget that a dataset created under another dataset doesn't
share data with the parent.

 Key change is a very difficult problem because in some cases it can mean 
 rewritting all previous data, in other cases it just means start using the 
 new key now but keep the 
 old one.   Which is correct depends on why you are doing a key change.  Key 
 change for data at rest is a very different problem space from rekey in a 
 network protocol.

Key change is nice and algorithm change possibility is also nice in case
the one you use become broken.
What I'm doing in geli (my disk encryption software for FreeBSD) is to
use random, strong master key, which is encrypted by user's passphrase,
keyfiles, etc. This is nice because changing user's passphrase doesn't
affect the master key, thus doesn't cost any I/O operations.
Another nice thing about it is that you can have many copies of the
master key protected by different passphrases. For example two persons
can decrypt your data: you and security officer in your company.

On the other hand, changing the master key should also be possible.

A good starting point IMHO will be to support user's passphrase
(keyfile, etc.) change (without touching the master key) and document
changing the master key, algorithm, key length, etc. via eg. local zfs
send/recv.

 In theory the algorithm could be different per dnode_phys_t just like 
 checksum/compression are today, however having aes-128 on one dnode and 
 aes-256 on another causes a 
 problem because you also need different keys for them, it gets even more 
 complex if you consider the algorithm mode and if you choose completely 
 different algorithms.  
 Having a different algorithm and key length will certainly be possible for 
 different filesystems though (eg root with aes-128 and home with aes-256).

Maybe keys should be pool's properties? You add new key to the pool and
then assign selected key to the given datasets. You can then unlock
the key using zpool(1M) or you'll be asked to unlock all keys used by
a dataset when you want to mount/attach it (file system or zvol). Once
the key is unlocked, the remaining datasets that use the same key can
be mounted/attached automatically. Just a thought...

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpnlZmnidmyi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs rewrite?

2007-01-28 Thread Pawel Jakub Dawidek

On Fri, Jan 26, 2007 at 06:08:50PM -0800, Darren Dunham wrote:
  What do you guys think about implementing 'zfs/zpool rewrite' command?
  It'll read every block older than the date when the command was executed
  and write it again (using standard ZFS COW mechanism, simlar to how
  resilvering works, but the data is read from the same disk it is written to=
  ).
 
 #1 How do you control I/O overhead?

The same way it is handled for scrub and resilver.

 #2 Snapshot blocks are never rewritten at the moment.  Most of your
suggestions seem to imply working on the live data, but doing that
for snapshots as well might be tricky. 

Good point, see below.

  3. I created file system with huge amount of data, where most of the
  data is read-only. I change my server from intel to sparc64 machine.
  Adaptive endianess only change byte order to native on write and because
  file system is mostly read-only, it'll need to byteswap all the time.
  And here comes 'zfs rewrite'!
 
 It's only the metadata that is modified anyway, not the file data.  I
 would hope that this could be done more easily than a full tree rewrite
 (and again the issue with snapshots).  Also, the overhead there probably
 isn't going to be very high (since the metadata will be cached in most
 cases).  

Agreed. Probably in this case there should be rewrite-only-metadata
mode. I agree the overhead is probably not high, but on the other hand,
I'm quite sure there are workload, which will see the difference, eg.
'find / -name something'.

 Other than that, I'm guessing something like this will be necessary to
 implement disk evacuation/removal.  If you have to rewrite data from one
 disk to elsewhere in the pool, then rewriting the entire tree shouldn't
 be much harder.

How did I forget about this one?:) That's right. I belive ZFS will gain
such ability at some point and rewrite functionality fits very nice
here: mark the disk/mirror/raid-z as no-more-writes and start rewrite
process (probably only limited to this entity). To implement such
functionality there also has to be a way to migrate snapshot data, so
sooner or later there will be a need for moving snapshot blocks.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpsIUZEgB2Q6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-13 Thread Pawel Jakub Dawidek

On Mon, Jan 08, 2007 at 11:00:36AM -0600, [EMAIL PROTECTED] wrote:
 I have been looking at zfs source trying to get up to speed on the
 internals.  One thing that interests me about the fs is what appears to be
 a low hanging fruit for block squishing CAS (Content Addressable Storage).
 I think that in addition to lzjb compression, squishing blocks that contain
 the same data would buy a lot of space for administrators working in many
 common workflows.
[...]

I like the idea, but I'd prefer to see such option to be per-pool, not
per-filesystem option.

I found somewhere in ZFS documentation that clones are nice to use for a
large number of diskless stations. That's fine, but after every upgrade,
more and more files are updated and fewer and fewer blocks are shared
between clones. Having such functionality for the entire pool would be a
nice optimization in this case. This doesn't have to be per-pool option
actually, but per-filesystem-hierarchy, ie. all file systems under
tank/diskless/.

I'm not yet sure how you can build the list of hash-to-block mappings for
large pools on boot fast...

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpIN0bljATsF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Differences between ZFS and UFS.

2006-12-30 Thread Pawel Jakub Dawidek

On Sat, Dec 30, 2006 at 11:28:55AM +0100, [EMAIL PROTECTED] wrote:
 
 Bascially ZFS pass all my tests (about 3000). I see one problem with UFS
 and two differences:
 
 That's good; do you have those tests published anywhere.

I'll publish them once I finish with Linux. They already work for
FreeBSD/UFS, FreeBSD/ZFS, Solaris/UFS and Solaris/ZFS.

 1. link(2) manual page states that privileged processes can make
multiple links to a directory. This looks like a general comment, but
it's only true for UFS.
 
 Solaris UFS doesn't deal gracefully with that.  (Fsck will complain and
 fix the fs and two fsck passes are generally needed.)
 
 An argument can be made to ban this for UFS too.
 
 (Some of the other fses do support this, like tmpfs)

Maybe it's just worth mentioning in the manual page which file systems
support this feature.

 2. link(2) in UFS allows to remove directories, but doesn't allow this
in ZFS.
 
 Link with the target being a directory and the source a any file or
 only directories?  And only as superuer?

I'm sorry, I ment unlink(2) here.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpAOy9WubbVA.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Differences between ZFS and UFS.

2006-12-29 Thread Pawel Jakub Dawidek

Hi.

Here are some things my file system test suite discovered on Solaris ZFS
and UFS.

Bascially ZFS pass all my tests (about 3000). I see one problem with UFS
and two differences:

1. link(2) manual page states that privileged processes can make
   multiple links to a directory. This looks like a general comment, but
   it's only true for UFS.

2. link(2) in UFS allows to remove directories, but doesn't allow this
   in ZFS.

3. Unsuccessful link(2) can update file's ctime:

# fstest mkdir foo 0755
# fstest create foo/bar 0644
# fstest chown foo/bar 65534 -1
# ctime1=`fstest stat foo/bar ctime`
# sleep 1
# fstest -u 65534 link foo/bar foo/baz   --- this unsuccessful 
operation updates ctime
EACCES
# ctime2=`fstest stat ${n0} ctime`
# echo $ctime1 $ctime2
1167440797 1167440798

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpWPytzVTegq.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: [security-discuss] Thoughts on ZFS Secure Delete - without using Crypto

2006-12-20 Thread Pawel Jakub Dawidek

On Tue, Dec 19, 2006 at 02:04:37PM +, Darren J Moffat wrote:
 In case it wasn't clear I am NOT proposing a UI like this:
 
 $ zfs bleach ~/Documents/company-finance.odp
 
 Instead ~/Documents or ~ would be a ZFS file system with a policy set 
 something like this:
 
 # zfs set erase=file:zero
 
 Or maybe more like this:
 
 # zfs create -o erase=file -o erasemethod=zero homepool/darrenm
 
 The goal is the same as the goal for things like compression in ZFS, no 
 application change it is free for the applications.

I like the idea, I really do, but it will be s expensive because of
ZFS' COW model. Not only file removal or truncation will call bleaching,
but every single file system modification... Heh, well, if privacy of
your data is important enough, you probably don't care too much about
performance. I for one would prefer encryption, which may turns out to be
much faster than bleaching and also more secure.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpbmTXIeqnBE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS goes catatonic when drives go dead?

2006-11-23 Thread Pawel Jakub Dawidek

On Thu, Nov 23, 2006 at 12:09:09PM +0100, Pawel Jakub Dawidek wrote:
 On Wed, Nov 22, 2006 at 03:38:05AM -0800, Peter Eriksson wrote:
  There is nothing in the ZFS FAQ about this. I also fail to see how FMA 
  could make any difference since it seems that ZFS is deadlocking somewhere 
  in the kernel when this happens...
  
  It works if you wrap all the physical devices inside SVM metadevices and 
  use those for your
  ZFS/zpool instead. Ie:
  
  metainit d101 1 1 c1t5d0s0
  metainit d102 1 1 c1t5d1s0
  metainit d103 1 1 c1t5d2s0
  zpool create foo radz /dev/md/dsk/d101 /dev/md/dsk/d102 /dev/md/dsk/d103
  
  Another unrelated observation - I've noticed that ZFS often works *faster* 
  if I wrap a physical partition inside a metadevice and then feed that to 
  zpool instead of using the raw partition directly with zpool... Example: 
  Testing ZFS on a spare 40GB partition of the boot ATA disk in an Sun Ultra 
  10/440 gives horrible performance numbers. If I wrap that into a simple 
  metadevice and feed to ZFS things work much faster... Ie:
  
  Zpool containing one normal disk partition:
  
  # /bin/time mkfile 1G 1G
  real 2:46.5
  user0.4
  sys24.1
  -- 6MB/s (that was actually the best number I got - the worst was 3:03 
  minutes)
  
  Zpool containing one SVM metadevice containing the same disk partition:
  
  #/bin/time mkfile 1G 1G
  real 1:41.6
  user0.3
  sys23.3
  -- 10MB/s
  
  (Idle machine in both cases, mkfile rerun a couple of times, with the same 
  results. I removed the 1G file between reruns of course)
 
 It may be because for raw disks ZFS flushes write cache (via
 DKIOCFLUSHWRITECACHE), which can be expensive operation and highly
 depend on disks/controllers used. I doubt it does the same for
 metadevices, but I may be wrong.

Oops, you operate on partitions... I think for partitions ZFS disables
write cache on disks... Anyway, I'll leave the answer to someone more
clueful.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpRnD68YRO9g.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS patches for FreeBSD.

2006-11-15 Thread Pawel Jakub Dawidek

Just to let you know that first set of patches for FreeBSD is now
available:

http://lists.freebsd.org/pipermail/freebsd-fs/2006-November/002385.html

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp7QLQ4XRMld.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Porting ZFS file system to FreeBSD.

2006-10-26 Thread Pawel Jakub Dawidek

On Tue, Sep 05, 2006 at 10:49:11AM +0200, Pawel Jakub Dawidek wrote:
 On Tue, Aug 22, 2006 at 12:45:16PM +0200, Pawel Jakub Dawidek wrote:
  Hi.
  
  I started porting the ZFS file system to the FreeBSD operating system.
 [...]
 
 Just a quick note about progress in my work. I needed slow down a bit,
 but:

Here is another update:

After way too much time spend on fighting the buffer cache I finally
made mmap(2)ed reads/writes to work and (which is also very important)
keep regular reads/writes working.

Now I'm able to build FreeBSD's kernel and userland with both sources
and objects placed on ZFS file system.

I also tried to crash it with fsx, fsstress and postmark, but no luck,
it works stable.

On the other hand I'm quite sure there are many problems in ZPL still,
but fixing mmap(2) allows me to move forward.

As a said note - ZVOL seems to be full functional.

I need to find a way to test ZIL, so if you guys at SUN have some ZIL
tests like uncleanly stopped file system, which at mount time will
exercise entire ZIL functionality where we can verify that my FS was
fixed properly that would be great.

PS. There is still a lot to do, so please, don't ask me for patches yet.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgprfZutEQMXa.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-27 Thread Pawel Jakub Dawidek

On Tue, Sep 12, 2006 at 03:56:00PM -0700, Matthew Ahrens wrote:
 Matthew Ahrens wrote:
[...]
 Given the overwhelming criticism of this feature, I'm going to shelve it for 
 now.

I'd really like to see this feature. You say ZFS should change our view
on filesystems, I say be consequent.

In ZFS world we create one big pool out of all our disks and create
filesystems on top of it. This way we don't have to care about resizing
them, etc. But this way we define redundancy at pool level for all our
filesystems.

It is quite common that we have data we don't really care about as well
as data we do care about a lot in the same pool. Before ZFS, I'd just
create RAID0 for the former and RAID1 for the latter, but this is not
the ZFS way, right?

My question is how can I express my intent of defining redundancy level
based of the importance of my data, but still following the ZFS way
without 'copies' feature?

Please reconsider your choice.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpRd16TY8bxr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Porting ZFS file system to FreeBSD.

2006-09-05 Thread Pawel Jakub Dawidek

On Tue, Aug 22, 2006 at 12:45:16PM +0200, Pawel Jakub Dawidek wrote:
 Hi.
 
 I started porting the ZFS file system to the FreeBSD operating system.
[...]

Just a quick note about progress in my work. I needed slow down a bit,
but:

All file system operations seems to work. The only exception are
operations needed for mmap(2) to work. Bascially file system works quite
stable even under heavy load. I've problem with two assertions I'm
hitting when running some heavy regression tests.

I've spend a couple of days fighting with snapshots. To be able to
implement them I needed to port GFS from Solaris (Generic
pseudo-filesystem). Now, snapshots (and clones) seems to work just fine.

Some other minor bits like zpool import/export, etc. now also work.

File system is not yet marked as MPSAFE (it still operates under the
Giant lock).

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpe5oKhmWBol.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Porting ZFS file system to FreeBSD.

2006-08-22 Thread Pawel Jakub Dawidek

Hi.

I started porting the ZFS file system to the FreeBSD operating system.

There is a lot to do, but I'm making good progress, I think.

I'm doing my work in those directories:

contrib/opensolaris/ - userland files taken directly from
OpenSolaris (libzfs, zpool, zfs and others)

sys/contrib/opensolaris/ - kernel files taken directly from
OpenSolaris (zfs, taskq, callb and others)

compat/opensolaris/ - compatibility userland layer, so I can
reduce diffs against vendor files

sys/compat/opensolaris/ - compatibility kernel layer, so I can
reduce diffs against vendor files (kmem based on
malloc(9) and uma(9), mutexes based on our sx(9) locks,
condvars based on sx(9) locks and more)

cddl/ - FreeBSD specific makefiles for userland bits

sys/modules/zfs/ - FreeBSD specific makefile for the kernel
module

You can find all those on FreeBSD perforce server:


http://perforce.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/pjd/zfsHIDEDEL=NO

Ok, so where am I?

I ported the userland bits (libzfs, zfs and zpool). I had ztest and
libzpool compiling and working as well, but I left them behind for now
to focus on kernel bits.

I'm building in all (except 2) files into zfs.ko (kernel module).

I created new VDEV - vdev_geom, which fits to FreeBSD's GEOM
infrastructure, so basically you can use any GEOM provider to build your
ZFS pool. VDEV_GEOM is implemented as consumers-only GEOM class.

I reimplemented ZVOL to also export storage as GEOM provider. This time
it is providers-only GEOM class.

This way one can create for example RAID-Z on top of GELI encrypted
disks or encrypt ZFS volume. The order is free.
Basically you can put UFS on ZFS volumes already and it behaves really
stable even under heavy load.

Currently I'm working on file system bits (ZPL), which is the most hard
part of the entire ZFS port, because it talks to one of the most complex
part of the FreeBSD kernel - VFS.

I can already mount ZFS-created file systems (with 'zfs create'
command), create files/directories, change permissions/owner/etc., list
directories content, and perform few other minor operation.

Some screenshots:

lcf:root:~# uname -a
FreeBSD lcf 7.0-CURRENT FreeBSD 7.0-CURRENT #74: Tue Aug 22 03:04:01 
UTC 2006 [EMAIL PROTECTED]:/usr/obj/zoo/pjd/lcf/sys/LCF  i386

lcf:root:~# zpool create tank raidz /dev/ad4a /dev/ad6a /dev/ad5a

lcf:root:~# zpool list
NAMESIZEUSED   AVAILCAP  HEALTH ALTROOT
tank   35,8G   11,7M   35,7G 0%  ONLINE -

lcf:root:~# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  raidz1ONLINE   0 0 0
ad4aONLINE   0 0 0
ad6aONLINE   0 0 0
ad5aONLINE   0 0 0

errors: No known data errors

lcf:root:# zfs create -V 10g tank/vol
lcf:root:# newfs /dev/zvol/tank/vol
lcf:root:# mount /dev/zvol/tank/vol /mnt/test

lcf:root:# zfs create tank/fs

lcf:root:~# mount -t zfs,ufs
tank on /tank (zfs, local)
tank/fs on /tank/fs (zfs, local)
/dev/zvol/tank/vol on /mnt/test (ufs, local)

lcf:root:~# df -ht zfs,ufs
FilesystemSizeUsed   Avail Capacity  Mounted on
tank   13G 34K 13G 0%/tank
tank/fs13G 33K 13G 0%/tank/fs
/dev/zvol/tank/vol9.7G4.0K8.9G 0%/mnt/test

lcf:root:~# mkdir /tank/fs/foo
lcf:root:~# touch /tank/fs/foo/bar
lcf:root:~# chown root:operator /tank/fs/foo /tank/fs/foo/bar
lcf:root:~# chmod 500 /tank/fs/foo
lcf:root:~# ls -ld /tank/fs/foo /tank/fs/foo/bar
dr-x--  2 root  operator  3 22 sie 05:41 /tank/fs/foo
-rw-r--r--  1 root  operator  0 22 sie 05:42 /tank/fs/foo/bar

The most important missing pieces:
- Most of the ZPL layer.
- Autoconfiguration. I need implement vdev discovery based on GEOM's taste
  mechanism.
- .zfs/ control directory (entirely commented out for now).
And many more, but hey, this is after 10 days of work.

PS. Please contact me privately if your company would like to donate to the
ZFS effort. Even without sponsorship the work will be finished, but
your contributions will allow me to spend more time working on ZFS.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp6CuJUY5xmT.pgp
Description: PGP

Re: [zfs-discuss] Porting ZFS file system to FreeBSD.

2006-08-22 Thread Pawel Jakub Dawidek

On Tue, Aug 22, 2006 at 12:22:44PM +0100, Dick Davies wrote:
 This is fantastic work!
 
 How long have you been at it?

As I said, 10 days, but this is really far from beeing finished.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgprXMey7FmJf.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: [fbsd] Porting ZFS file system to FreeBSD.

2006-08-22 Thread Pawel Jakub Dawidek

On Tue, Aug 22, 2006 at 04:30:44PM +0200, Jeremie Le Hen wrote:
 I don't know much about ZFS, but Sun states this is a 128 bits
 filesystem.  How will you handle this in regards to the FreeBSD
 kernel interface that is already struggling to be 64 bits
 compliant ?  (I'm stating this based on this URL [1], but maybe
 it's not fully up-to-date.)

128 bits is not my goal, but I do want all the other goodies:)

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgpZ0kyAmmAEI.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

79 matches

Mail list logo