Re: [zfs-discuss] Dedup - Does on imply sha256?

2010-08-24 Thread Jeff Bonwick
Correct.

Jeff

On Aug 24, 2010, at 9:45 PM, Peter Taps wrote:

 Folks,
 
 One of the articles on the net says that the following two commands are 
 exactly the same:
 
 # zfs set dedup=on tank
 # zfs set dedup=sha256 tank
 
 Essentially, on is just a pseudonym for sha256 and verify is just a 
 pseudonym for sha256,verify.
 
 Can someone please confirm if this is true?
 
 Thank you in advance for your help.
 
 Regards,
 Peter
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] gang blocks at will?

2010-05-26 Thread Jeff Bonwick
You can set metaslab_gang_bang to (say) 8k to force lots of gang block 
allocations.

Jeff

On May 25, 2010, at 11:42 PM, Andriy Gapon wrote:

 
 I am working on improving some ZFS-related bits in FreeBSD boot chain.
 At the moment it seems that the things work mostly fine except for a case 
 where
 the boot code needs to read gang blocks.  We have some reports from users 
 about
 failures, but unfortunately their pools are not available for testing anymore
 and I can not reproduce the issue at will.
 I am sure that (Open)Solaris GRUB version has been properly tested, including
 the above environment.
 Could you please help me with ideas how to create a pool/filesystem/file that
 would have gang-blocks with high probability?
 Perhaps, there are some pre-made test pool images available?
 Or some specialized tool?
 
 Thanks a lot!
 -- 
 Andriy Gapon
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-02-16 Thread Jeff Bonwick
 People used fastfs for years in specific environments (hopefully 
 understanding the risks), and disabling the ZIL is safer than fastfs. 
 Seems like it would be a useful ZFS dataset parameter.

We agree.  There's an open RFE for this:

6280630 zil synchronicity

No promise on date, but it will bubble to the top eventually.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compressratio vs. dedupratio

2009-12-13 Thread Jeff Bonwick
It is by design.  The idea is to report the dedup ratio for the data
you've actually attempted to dedup.  To get a 'diluted' dedup ratio
of the sort you describe, just compare the space used by all datasets
to the space allocated in the pool.  For example, on my desktop,
I have a pool called 'builds' with dedup enabled on some datasets:

$ zfs get used builds
NAMEPROPERTY  VALUE  SOURCE
builds  used  81.0G  -
$ zpool get allocated builds
NAMEPROPERTY   VALUE  SOURCE
builds  allocated  47.4G  -

Thus my diluted dedup ratio is 81.0 / 47.4 = 1.71.

Jeff

On Sat, Dec 12, 2009 at 10:06:49PM +, Robert Milkowski wrote:
 Hi,
 
 The compressratio property seems to be a ratio of compression for a 
 given dataset calculated in such a way so all data in it (compressed or 
 not) is taken into account.
 The dedupratio property on the other hand seems to be taking into 
 account only dedupped data in a pool.
 So for example if there is already 1TB of data before dedup=on and then 
 dedup is set to on and 3 small identical files are copied in the 
 dedupratio will be 3. IMHO it is misleading as it suggest that on 
 average a ratio of 3 was achieved in a pool which is not true.
 
 Is it by design or is it a bug?
 If it is by design then having an another property which would give a 
 ratio of dedup in relation to all data in a pool (dedupped or not) would 
 be useful.
 
 
 Example (snv 129):
 
 
 mi...@r600:/rpool/tmp# mkfile 200m file1
 mi...@r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1
 
 mi...@r600:/rpool/tmp# ls -l /var/adm/messages
 -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages
 mi...@r600:/rpool/tmp# cp /var/adm/messages /test/
 mi...@r600:/rpool/tmp# sync
 mi...@r600:/rpool/tmp# zfs get compressratio test
 NAME  PROPERTY   VALUE  SOURCE
 test  compressratio  1.00x  -
 
 
 mi...@r600:/rpool/tmp# zfs set compression=gzip test
 mi...@r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
 mi...@r600:/rpool/tmp# sync
 mi...@r600:/rpool/tmp# zfs get compressratio test
 NAME  PROPERTY   VALUE  SOURCE
 test  compressratio  1.27x  -
 
 
 mi...@r600:/rpool/tmp# zfs get compressratio test
 NAME  PROPERTY   VALUE  SOURCE
 test  compressratio  1.24x  -
 
 
 
 
 
 mi...@r600:/rpool/tmp# zpool destroy test
 mi...@r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1
 mi...@r600:/rpool/tmp# zpool get dedupratio test
 NAME  PROPERTYVALUE  SOURCE
 test  dedupratio  1.00x  -
 
 
 mi...@r600:/rpool/tmp# cp /var/adm/messages /test/
 mi...@r600:/rpool/tmp# sync
 mi...@r600:/rpool/tmp# zpool get dedupratio test
 NAME  PROPERTYVALUE  SOURCE
 test  dedupratio  1.00x  -
 
 mi...@r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
 mi...@r600:/rpool/tmp# sync
 mi...@r600:/rpool/tmp# zpool get dedupratio test
 NAME  PROPERTYVALUE  SOURCE
 test  dedupratio  1.00x  -
 mi...@r600:/rpool/tmp# cp /var/adm/messages /test/messages.2
 mi...@r600:/rpool/tmp# sync
 mi...@r600:/rpool/tmp# zpool get dedupratio test
 NAME  PROPERTYVALUE  SOURCE
 test  dedupratio  2.00x  -
 
 mi...@r600:/rpool/tmp# rm /test/messages
 mi...@r600:/rpool/tmp# sync
 mi...@r600:/rpool/tmp# zpool get dedupratio test
 NAME  PROPERTYVALUE  SOURCE
 test  dedupratio  2.00x  -
 
 
 
 
 
 
 
 -- 
 Robert Milkowski
 http://milek.blogspot.com
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Doing ZFS rollback with preserving later created clones/snapshot?

2009-12-11 Thread Jeff Bonwick
Yes, although it's slightly indirect:

- make a clone of the snapshot you want to roll back to
- promote the clone

See 'zfs promote' for details.

Jeff

On Fri, Dec 11, 2009 at 08:37:04AM +0100, Alexander Skwar wrote:
 Hi.
 
 Is it possible on Solaris 10 5/09, to rollback to a ZFS snapshot,
 WITHOUT destroying later created clones or snapshots?
 
 Example:
 
 --($ ~)-- sudo zfs snapshot rpool/r...@01
 
 --($ ~)-- sudo zfs snapshot rpool/r...@02
 
 --($ ~)-- sudo zfs clone rpool/r...@02 rpool/ROOT-02
 
 --($ ~)-- LC_ALL=C sudo zfs rollback rpool/r...@01
 cannot rollback to 'rpool/r...@01': more recent snapshots exist
 use '-r' to force deletion of the following snapshots:
 rpool/r...@02
 
 So it isn't as simple as that. But what needs to be done, to preserve
 rpool/ROOT-02?
 
 Actually, I'm not concerned (that much) with preserving the clone
 rpool/ROOT-02. But I'd like to keep the contents of rpool/ROOT as
 it was when I created the @02 snapshot.
 
 Is the only possible way to create a backup of rpool/r...@02 (eg.
 of the snapshot directory /rpool/ROOT/.zfs/snapshots/02) and then
 restore it later on (eg. backup to tape, backup to someother filesystem
 using zfs send|recv, rsync, tar, ...)?
 
 Thanks a lot,
 
 Alexander
 --
 ? Keine Internetzensur in Deutschland! ? http://zensursula.net ? ?
 ?? ? Lifestream (Twitter, Blog, ?) ??http://alexs77.soup.io/ ? ? ?
 ? Chat (Jabber/Google Talk) ? a.sk...@gmail.com , AIM: alexws77 ??
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication - deleting the original

2009-12-08 Thread Jeff Bonwick
 i am no pro in zfs, but to my understanding there is no original.

That is correct.  From a semantic perspective, there is no change
in behavior between dedup=off and dedup=on.  Even the accounting
remains the same: each reference to a block is charged to the dataset
making the reference.  The only place you see the effect of dedup
is at the pool level, which can now have more logical than physical
data.  You may also see a difference in performance, which can be
either positive or negative depending on a whole bunch of factors.

At the implementation level, all that's really happening with dedup
is that when you write a block whose contents are identical to an
existing block, instead of allocating new disk space we just increment
a reference count on the existing block.  When you free the block
(from the dataset's perspective), the storage pool decrements the
reference count, but the block remains allocated at the pool level.
When the reference count goes to zero, the storage pool frees the
block for real (returns it to the storage pool's free space map).

But, to reiterate, none of this is visible semantically.  The only
way you can even tell dedup is happening is to observe that the
total space used by all datasets exceeds the space allocated from
the pool -- i.e. that the pool's dedup ratio is greater than 1.0.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken

2009-11-23 Thread Jeff Bonwick
And, for the record, this is my fault.  There is an aspect of endianness
that I simply hadn't thought of.  When I have a little more time I will
blog about the whole thing, because there are many useful lessons here.

Thank you, Matt, for all your help with this.  And my apologies to
everyone else for the disruption.

Jeff

On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote:
 We discovered another, more fundamental problem with 
 dedup=fletcher4,verify. I've just putback the fix for:
 
 6904243 zpool scrub/resilver doesn't work with cross-endian 
 dedup=fletcher4,verify blocks
 
 The same instructions as below apply, but in addition, the 
 dedup=fletcher4,verify functionality has been removed.  We will investigate 
 whether it's possible to fix these isses and re-enable this functionality.
 
 --matt
 
 
 Matthew Ahrens wrote:
 If you did not do zfs set dedup=fletcher4,verify fs (which is 
 available in build 128 and nightly bits since then), you can ignore this 
 message.
 
 We have changed the on-disk format of the pool when using 
 dedup=fletcher4,verify with the integration of:
 
 6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots 
 of hash collisions
 
 This is not the default dedup setting; pools that only used zfs set 
 dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected.
 
 Before installing bits with this fix, you will need to destroy any 
 filesystems that have had dedup=fletcher4,verify set on them.  You can 
 preserve your existing data by running:
 
 zfs set dedup=any other setting old fs
 zfs snapshot -r old fs@snap
 zfs create new fs
 zfs send -R old fs@snap | zfs recv -d new fs
 zfs destroy -r old fs
 
 Simply changing the setting from dedup=fletcher4,verify to another 
 setting is not sufficient, as this does not modify existing data.
 
 You can verify that your pool isn't using dedup=fletcher4,verify by running
 zdb -D pool | grep DDT-fletcher4
 If there are no matches, your pool is not using dedup=fletcher4,verify, 
 and it is safe to install bits with this fix.
 
 Build 128 will be respun to include this fix.
 
 Sorry for the inconvenience,
 
 -- team zfs
 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken

2009-11-23 Thread Jeff Bonwick
Finally, just to be clear, one last point:  the two fixes integrated
today only affect you if you've explicitly set dedup=fletcher4,verify.
To quote Matt:

 This is not the default dedup setting; pools that only used zfs set 
 dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected.

Jeff

On Mon, Nov 23, 2009 at 09:44:41PM -0800, Jeff Bonwick wrote:
 And, for the record, this is my fault.  There is an aspect of endianness
 that I simply hadn't thought of.  When I have a little more time I will
 blog about the whole thing, because there are many useful lessons here.
 
 Thank you, Matt, for all your help with this.  And my apologies to
 everyone else for the disruption.
 
 Jeff
 
 On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote:
  We discovered another, more fundamental problem with 
  dedup=fletcher4,verify. I've just putback the fix for:
  
  6904243 zpool scrub/resilver doesn't work with cross-endian 
  dedup=fletcher4,verify blocks
  
  The same instructions as below apply, but in addition, the 
  dedup=fletcher4,verify functionality has been removed.  We will investigate 
  whether it's possible to fix these isses and re-enable this functionality.
  
  --matt
  
  
  Matthew Ahrens wrote:
  If you did not do zfs set dedup=fletcher4,verify fs (which is 
  available in build 128 and nightly bits since then), you can ignore this 
  message.
  
  We have changed the on-disk format of the pool when using 
  dedup=fletcher4,verify with the integration of:
  
  6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots 
  of hash collisions
  
  This is not the default dedup setting; pools that only used zfs set 
  dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected.
  
  Before installing bits with this fix, you will need to destroy any 
  filesystems that have had dedup=fletcher4,verify set on them.  You can 
  preserve your existing data by running:
  
  zfs set dedup=any other setting old fs
  zfs snapshot -r old fs@snap
  zfs create new fs
  zfs send -R old fs@snap | zfs recv -d new fs
  zfs destroy -r old fs
  
  Simply changing the setting from dedup=fletcher4,verify to another 
  setting is not sufficient, as this does not modify existing data.
  
  You can verify that your pool isn't using dedup=fletcher4,verify by running
  zdb -D pool | grep DDT-fletcher4
  If there are no matches, your pool is not using dedup=fletcher4,verify, 
  and it is safe to install bits with this fix.
  
  Build 128 will be respun to include this fix.
  
  Sorry for the inconvenience,
  
  -- team zfs
  
  
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Jeff Bonwick
 Terrific! Can't wait to read the man pages / blogs about how to use it...

Just posted one:

http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup

Enjoy, and let me know if you have any questions or suggestions for
follow-on posts.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apple cans ZFS project

2009-10-24 Thread Jeff Bonwick
 Apple can currently just take the ZFS CDDL code and incorporate it  
 (like they did with DTrace), but it may be that they wanted a private  
 license from Sun (with appropriate technical support and  
 indemnification), and the two entities couldn't come to mutually  
 agreeable terms.

I cannot disclose details, but that is the essence of it.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacing a failed drive

2009-06-19 Thread Jeff Bonwick
Yep, you got it.

Jeff

On Fri, Jun 19, 2009 at 04:15:41PM -0700, Simon Breden wrote:
 Hi,
 
 I have a ZFS storage pool consisting of a single RAIDZ2 vdev of 6 drives, and 
 I have a question about replacing a failed drive, should it occur in future.
 
 If a drive fails in this double-parity vdev, then am I correct in saying that 
 I would need to (1) unplug the old drive once I've identified the drive id 
 (c1t0d0 etc), (2) plug in the new drive on the same SATA cable, and (3) issue 
 a 'zpool replace pool_name drive_id' command etc, at which point ZFS will 
 resilver the new drive from the parity data ?
 
 Thanks,
 Simon
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mobo SATA migration to AOC-SAT2-MV8 SATA card

2009-06-19 Thread Jeff Bonwick
Yep, right again.

Jeff

On Fri, Jun 19, 2009 at 04:21:42PM -0700, Simon Breden wrote:
 Hi,
 
 I'm using 6 SATA ports from the motherboard but I've now run out of SATA 
 ports, and so I'm thinking of adding a Supermicro AOC-SAT2-MV8 8-port SATA 
 controller card.
 
 What is the procedure for migrating the drives to this card?
 Is it a simple case of (1) issuing a 'zpool export pool_name' command, (2) 
 shutdown, (3) insert card and move all SATA cables for drives from mobo to 
 card, (4) boot and issue a 'zpool import pool_name' command ?
 
 Thanks,
 Simon
 
 http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resilver Performance and Behavior

2009-05-03 Thread Jeff Bonwick
 According to the ZFS documentation, a resilver operation
 includes what is effectively a dirty region log (DRL) so that if the
 resilver is interrupted, by a snapshot or reboot, the resilver can
 continue where it left off.
 
 That is not the case.  The dirty region log keeps track of what time 
 periods a device was offline, so that if a device is goes offline and comes 
 back soon thereafter, only the recent data needs to be resilvered.

And for that reason we call it the Dirty Time Log (DTL) rather than DRL.
This is efficient because actual device outages are temporal, not spatial.
As a rule, a 5-minute outage can be fully resilvered in 5 minutes or less.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Peculiarities of COW over COW?

2009-04-27 Thread Jeff Bonwick
 ZFS blocksize is dynamic, power of 2, with a max size == recordsize.

Minor clarification: recordsize is restricted to powers of 2, but
blocksize is not -- it can be any multiple of sector size (512 bytes).
For small files, this matters: a 37k file is stored in a 37k block.
For larger, multi-block files, the size of each block is indeed a
power of 2 (simplifies the math a bit).

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-04-08 Thread Jeff Bonwick
  Yes, I made note of that in my OP on this thread.  But is it enough to
  end up with 8gb of non-compressed files measuring 8gb on
  reiserfs(linux) and the same data showing nearly 9gb when copied to a
  zfs filesystem with compression on.  
 
 whoops.. a hefty exaggeration it only shows about 16mb difference.
 But still since zfs side is compressed, that seems like quite a lot..

That's because ZFS reports *all* space consumed by a file, including
all metadata (dnodes, indirect blocks, etc).  For an 8G file stored
in 128K blocks, there are 8G / 128K = 64K block pointers, each of
which is 128 bytes, and is two-way replicated (via ditto blocks),
for a total of 64K * 128 * 2 = 16M.  So this is exactly as expected.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data size grew.. with compression on

2009-03-30 Thread Jeff Bonwick
Right.

Another difference to be aware of is that ZFS reports the total
space consumed, including space for metadata -- typically around 1%.
Traditional filesystems like ufs and ext2 preallocate metadata and
don't count it as using space.  I don't know how reiserfs does its
bookkeeping, but I wouldn't be surprised if it followed that model.

Jeff

On Mon, Mar 30, 2009 at 02:57:31PM -0400, Brad Plecs wrote:
 
 I've run into this too... I believe the issue is that the block
 size/allocation unit size in ZFS is much larger than the default size
 on older filesystems (ufs, ext2, ext3).
 
 The result is that if you have lots of small files smaller than the
 block size, they take up more total space on the filesystem because
 they occupy at least the block size amount.
 
 See the 'recordsize' ZFS filesystem property, though re-reading the
 man pages, I'm not 100% sure that tuning this property will have the
 intended effect.
 
 BP 
 
 
  I rsynced an 11gb pile of data from a remote linux machine to a zfs
  filesystem with compression turned on.
  
  The data appears to have grown in size rather than been compressed.
  
  Many, even most of the files are formats that are already compressed,
  such as mpg jpg avi and several others.  But also many text files
  (*.html) are in there.  So didn't expect much compression but also
  didn't expect the size to grow.
  
  I realize these are different filesystems that may report
  differently.  Reiserfs on the linux machine and zfs on osol.
  
  in bytes:
  
   Osol:11542196307
  linux:11525114469
  =
   17081838
  
  Or (If I got the math right) about  16.29 MB bigger on the zfs side
  with compression on.
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 -- 
 bpl...@cs.umd.edu
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: creating multiple clones in one zfs(1) call and one txg

2009-03-29 Thread Jeff Bonwick
I agree with Chris -- I'd much rather do something like:

zfs clone snap1 clone1 snap2 clone2 snap3 clone3 ...

than introduce a pattern grammar.  Supporting multiple snap/clone pairs
on the command line allows you to do just about anything atomically.

Jeff

On Fri, Mar 27, 2009 at 10:46:33AM -0500, Chris Kirby wrote:
 On Mar 27, 2009, at 10:33 AM, Darren J Moffat wrote:
  a) that is probably what is wanted most of the time anyway
  b) it is easy to pass from userland to kernel - you pass the
 rules (after some userland sanity checking first) as is.
 
 
 But doesn't that also exclude the possibility of creating non-pattern  
 based
 clones in a single txg?
 
 While I think that allowing multiple clones to be created in a single  
 txg
 is perfectly reasonable, we shouldn't need to artificially restrict the
 clone namespace in order to achieve that.
 
 -Chris
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Forensics related ZFS questions

2009-03-16 Thread Jeff Bonwick
1. Does variable FSB block sizing extend to files larger than record size,
concerning the last FSB allocated?
 
In other words, for files larger than 128KB, that utilize more than one
full recordsize FSB, will the LAST FSB allocated be `right-sized' to fit
the remaining data, or will ZFS allocate a full recordsize FSB for the
last `chunk' of the file?

The last block is currently a multiple of the recordsize, but we intend
to fix this.  There are two options: one, to treat the last block as a
special case; the other, to handle it automatically via compression.
The former is a little more work, but has the advantage of reducing
the file's in-memory footprint as well as its on-disk footprint.

2. Can a developer confirm that COW occurs at the FSB level (vs. sector
level, for example)?
 
In other words, when a single FSB (say 64KB file w/ recordsize=128KB) file
is modified, and it's only one sector within that file that's modified, is
it correct that what's copied-on-write is the entire 64KB FSB allocated to
that file?  (This is a data recovery issue.)

Yes, that's correct.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-11 Thread Jeff Bonwick
 I'm rather tired of hearing this mantra.
 [...]
 Every file system needs a repair utility

Hey, wait a minute -- that's a mantra too!

I don't think there's actually any substantive disagreement here -- stating
that one doesn't need a separate program called /usr/sbin/fsck is not the
same as saying that filesystems don't need error detection and recovery.
There's quite a bit of that in the current code, and more in the works.
Like performance, it is never really done -- we can always do better.

 I've described before a number of checks which ZFS could perform [...]

Well, ZFS is open source.  I would love to see your passion for this topic
directed at the source code.  Seriously.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-11 Thread Jeff Bonwick
  This is CR 6667683
  http://bugs.opensolaris.org/view_bug.do?bug_id=6667683
 
 I think that would solve 99% of ZFS corruption problems!

Based on the reports I've seen to date, I think you're right.

 Is there any EDT for this patch?

Well, because of this thread, this has gone from on my list to
I'm currently working on it.  And I'd like to take moment to
thank everyone who's weighed in, because it really does make a
difference in setting priorities.

As for a date, I would estimate weeks, not months.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does your device honor write barriers?

2009-02-10 Thread Jeff Bonwick
 wellif you want a write barrier, you can issue a flush-cache and
 wait for a reply before releasing writes behind the barrier.  You will
 get what you want by doing this for certain.

Not if the disk drive just *ignores* barrier and flush-cache commands
and returns success.  Some consumer drives really do exactly that.
That's the issue that people are asking ZFS to work around.

But it's important to understand that this failure mode (silently
ignoring SCSI commands) is truly a case of broken-by-design hardware.
If a disk doesn't honor these commands, then no synchronous operation
is ever truly synchronous -- it'd be like your OS just ignoring O_SYNC.
This means you can't use such disks for (say) a database or NFS server,
because it is *impossible* to know when the data is on stable storage.

If it were possible to detect such disks, I'd add code to ZFS that
would simply refuse to use them.  Unfortunately, there is no reliable
way to test the functioning of synchonize-cache programmatically.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS: unreliable for professional usage?

2009-02-09 Thread Jeff Bonwick
 There is no substitute for cord-yank tests - many and often. The  
 weird part is, the ZFS design team simulated millions of them.
 So the full explanation remains to be uncovered?

We simulated power failure; we did not simulate disks that simply
blow off write ordering.  Any disk that you'd ever deploy in an
enterprise or storage appliance context gets this right.

The good news is that ZFS is getting popular enough on consumer-grade
hardware.  The bad news is that said hardware has a different set of
failure modes, so it takes a bit of work to become resilient to them.
This is pretty high on my short list.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snapshot identity

2009-02-03 Thread Jeff Bonwick
 The Validated Execution project is investigating how to utilize ZFS
 snapshots as the basis of a validated filesystem.  Given that the
 blocks of the dataset form a Merkel tree of hashes, it seemed
 straightforward to validate the individual objects in the snapshot and
 then sign the hash of the root as a means of indicating that the
 contents of the dataset were validated.

Yep, that would work.

 Unfortunately, the block hashes are used to assure the integrity of the
 physical representation of the dataset.  Those hash values can be
 updated during scrub operations, or even during data error recovery,
 while the logical content of the dataset remains intact.

Actually, that's not true -- at least not today.  Once you've taken a
snapshot, the content will never change.  Scrub, resilver, and self-heal
operations repair damaged copies of data, but they don't alter the
data itself, and therefore don't alter its checksum.

This will change when we add support for block rewrite, which will
allow us to do things like migrate data from one device to another,
or to recompress existing data, which *will* affect the checksum.

You may be able to tolerate this by simply precluding it, if you're
targeting a restricted environment.  For example, do you need this
feature for anything other than the root pool?

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS core contributor nominations

2009-02-02 Thread Jeff Bonwick
 I would like to nominate roch.bourbonn...@sun.com for his work on
 improving the performance of ZFS over the last few years.

Absolutely.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Where does set the value to zio-io_offset?

2009-01-24 Thread Jeff Bonwick
Each ZFS block pointer contains up to three DVAs (data virtual addresses),
to implement 'ditto blocks' (multiple copies of the data, above and beyond
any replication provided by mirroring or RAID-Z).  Semantically, ditto blocks
are a lot like mirrors, so we actually use the mirror code to read them.
We do this even in the degenerate single-copy case because it makes a
bunch of other simplifications possible.

Each DVA contains a vdev and offset, which are extracted by DVA_GET_VDEV()
and DVA_GET_OFFSET() for each DVA in vdev_mirror_map_alloc(), and stored
in the mirror map's mc_vd and mc_offset fields.  We then pass these values
to zio_vdev_child_io(), which zio_create()s a dependent child zio to read
or write the data.

Jeff

On Fri, Jan 23, 2009 at 10:53:35PM -0800, Jin wrote:
 Assume starting one disk write action, the vdev_disk_io_start will be called 
 from zio_execute. 
 
 static int vdev_disk_io_start(zio_t *zio)
 {
   ..
   bp-b_lblkno = lbtodb(zio-io_offset);
   ..
 } 
 
 After scaning over the zfs source, I find the zio-io_offset is only set 
 value in zio_create by the parameter offset. 
 
 zio_write calls zio_create with the value 0 for the parameter offset. I can't 
 find anywhere else the zio-io_offset being set. 
 
 After the new block born, the correct offset has been filled in bp-blk_dva 
 (see metaslab_alloc), when and where the correct value set to zio-io_offset?
 
 Who can tell me?
 
 thanks
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-13 Thread Jeff Bonwick
 Off the top of my head nearly all of them.  Some of them have artificial
 limitations because they learned the hard way that if you give customers
 enough rope they'll hang themselves.  For instance unlimited snapshots.

Oh, that's precious!  It's not an arbitrary limit, it's a safety feafure!

 Outside of that... I don't see ANYTHING in your list they didn't do first.

Then you don't know ANYTHING about either platform.  Constant-time
snapshots, for example.  ZFS has them;  NetApp's are O(N), where N is
the total number of blocks, because that's how big their bitmaps are.
If you think O(1) is not a revolutionary improvement over O(N),
then not only do you not know much about either snapshot algorithm,
you don't know much about computing.

Sorry, everyone else, for feeding the troll.  Chum the water all you like,
I'm done with this thread.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpol mirror creation after non-mirrored zpool is setup

2008-12-13 Thread Jeff Bonwick
On Sat, Dec 13, 2008 at 04:44:10PM -0800, Mark Dornfeld wrote:
 I have installed Solaris 10 on a ZFS filesystem that is not mirrored. Since I 
 have an identical disk in the machine, I'd like to add that disk to the 
 existing pool as a mirror. Can this be done, and if so, how do I do it?

Yes:

# zpool attach poolname old_disk new_disk

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Split responsibility for data with ZFS

2008-12-12 Thread Jeff Bonwick
 I'm going to pitch in here as devil's advocate and say this is hardly
 revolution.  99% of what zfs is attempting to do is something NetApp and
 WAFL have been doing for 15 years+.  Regardless of the merits of their
 patents and prior art, etc., this is not something revolutionarily new.  It
 may be revolution in the sense that it's the first time it's come to open
 source software and been given away, but it's hardly revolutionary in file
 systems as a whole.

99% of what ZFS is attempting to do?  Hmm, OK -- let's make a list:

end-to-end checksums
unlimited snapshots and clones
O(1) snapshot creation
O(delta) snapshot deletion
O(delta) incremental generation
transactionally safe RAID without NVRAM
variable blocksize
block-level compression
dynamic striping
intelligent prefetch with automatic length and stride detection
ditto blocks to increase metadata replication
delegated administration
scalability to many cores
scalability to huge datasets
hybrid storage pools (flash/disk mix) that optimize price/performance

How many of those does NetApp have?  I believe the correct answer is 0%.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression

2008-11-29 Thread Jeff Bonwick
 If you have more comments, or especially if you think I reached the wrong
 conclusion, please do post it.  I will post my continuing results.

I think your conclusions are correct.  The main thing you're seeing is
the combination of gzip-9 being incredibly CPU-intensive with our I/O
pipeline allowing too much of it to be scheduled in parallel.  The latter
is a bug we will fix; the former is the nature of the gzip algorithm.

One other thing you may encounter from time to time is slowdowns due to
kernel VA fragmentation.  The CPU you're using is 32-bit, so you're
running a 32-bit kernel, which has very little KVA.  This tends to be
more of a problem with big-memory machines, however -- e.g. a system
with 8GB running a 32-bit kernel.  With 768MB, you'll probably be OK,
but it's something to be aware of on any 32-bit system.  You can tell
if this is affecting you by looking for kernel threads stuck waiting
to allocate a virtual address:

# echo '::walk thread | ::findstack -v' | mdb -k | grep vmem_xalloc

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Smashing Baby a fake???

2008-11-25 Thread Jeff Bonwick
I think we (the ZFS team) all generally agree with you.  The current
nevada code is much better at handling device failures than it was
just a few months ago.  And there are additional changes that were
made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
product line that will make things even better once the FishWorks team
has a chance to catch its breath and integrate those changes into nevada.
And then we've got further improvements in the pipeline.

The reason this is all so much harder than it sounds is that we're
trying to provide increasingly optimal behavior given a collection of
devices whose failure modes are largely ill-defined.  (Is the disk
dead or just slow?  Gone or just temporarily disconnected?  Does this
burst of bad sectors indicate catastrophic failure, or just localized
media errors?)  The disks' SMART data is notoriously unreliable, BTW.
So there's a lot of work underway to model the physical topology of
the hardware, gather telemetry from the devices, the enclosures,
the environmental sensors etc, so that we can generate an accurate
FMA fault diagnosis and then tell ZFS to take appropriate action.

We have some of this today; it's just a lot of work to complete it.

Oh, and regarding the original post -- as several readers correctly
surmised, we weren't faking anything, we just didn't want to wait
for all the device timeouts.  Because the disks were on USB, which
is a hotplug-capable bus, unplugging the dead disk generated an
interrupt that bypassed the timeout.  We could have waited it out,
but 60 seconds is an eternity on stage.

Jeff

On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
 But that's exactly the problem Richard:  AFAIK.
 
 Can you state that absolutely, categorically, there is no failure mode out 
 there (caused by hardware faults, or bad drivers) that won't lock a drive up 
 for hours?  You can't, obviously, which is why we keep saying that ZFS should 
 have this kind of timeout feature.
 
 For once I agree with Miles, I think he's written a really good writeup of 
 the problem here.  My simple view on it would be this:
 
 Drives are only aware of themselves as an individual entity.  Their job is to 
 save  restore data to themselves, and drivers are written to minimise any 
 chance of data loss.  So when a drive starts to fail, it makes complete sense 
 for the driver and hardware to be very, very thorough about trying to read or 
 write that data, and to only fail as a last resort.
 
 I'm not at all surprised that drives take 30 seconds to timeout, nor that 
 they could slow a pool for hours.  That's their job.  They know nothing else 
 about the storage, they just have to do their level best to do as they're 
 told, and will only fail if they absolutely can't store the data.
 
 The raid controller on the other hand (Netapp / ZFS, etc) knows all about the 
 pool.  It knows if you have half a dozen good drives online, it knows if 
 there are hot spares available, and it *should* also know how quickly the 
 drives under its care usually respond to requests.
 
 ZFS is perfectly placed to spot when a drive is starting to fail, and to take 
 the appropriate action to safeguard your data.  It has far more information 
 available than a single drive ever will, and should be designed accordingly.
 
 Expecting the firmware and drivers of individual drives to control the 
 failure modes of your redundant pool is just crazy imo.  You're throwing away 
 some of the biggest benefits of using multiple drives in the first place.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lost Disk Space

2008-11-02 Thread Jeff Bonwick
Are you running this on a live pool?  If so, zdb can't get a reliable 
block count -- and zdb -L [live pool] emits a warning to that effect.

Jeff

On Thu, Oct 16, 2008 at 03:36:25AM -0700, Ben Rockwood wrote:
 I've been struggling to fully understand why disk space seems to vanish.  
 I've dug through bits of code and reviewed all the mails on the subject that 
 I can find, but I still don't have a proper understanding of whats going on.  
 
 I did a test with a local zpool on snv_97... zfs list, zpool list, and zdb 
 all seem to disagree on how much space is available.  In this case its only a 
 discrepancy of about 20G or so, but I've got Thumpers that have a discrepancy 
 of over 6TB!
 
 Can someone give a really detailed explanation about whats going on?
 
 block traversal size 670225837056 != alloc 720394438144 (leaked 50168601088)
 
 bp count:15182232
 bp logical:672332631040  avg:  44284
 bp physical:   669020836352  avg:  44066compression:   1.00
 bp allocated:  670225837056  avg:  44145compression:   1.00
 SPA allocated: 720394438144 used: 96.40%
 
 Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
 12   120K   26.5K   79.5K   6.62K4.53 0.00  deferred free
  1512 512   1.50K   1.50K1.00 0.00  object directory
  3  1.50K   1.50K   4.50K   1.50K1.00 0.00  object array
  116K   1.50K   4.50K   4.50K   10.67 0.00  packed nvlist
  -  -   -   -   -   --  packed nvlist size
 72  8.45M889K   2.60M   37.0K9.74 0.00  bplist
  -  -   -   -   -   --  bplist header
  -  -   -   -   -   --  SPA space map header
974  4.48M   2.65M   7.94M   8.34K1.70 0.00  SPA space map
  -  -   -   -   -   --  ZIL intent log
  96.7K  1.51G389M777M   8.04K3.98 0.12  DMU dnode
 17  17.0K   8.50K   17.5K   1.03K2.00 0.00  DMU objset
  -  -   -   -   -   --  DSL directory
 13  6.50K   6.50K   19.5K   1.50K1.00 0.00  DSL directory child 
 map
 12  6.00K   6.00K   18.0K   1.50K1.00 0.00  DSL dataset snap map
 14  38.0K   10.0K   30.0K   2.14K3.80 0.00  DSL props
  -  -   -   -   -   --  DSL dataset
  -  -   -   -   -   --  ZFS znode
  2 1K  1K  2K  1K1.00 0.00  ZFS V0 ACL
  5.81M   558G557G557G   95.8K1.0089.27  ZFS plain file
   382K   301M200M401M   1.05K1.50 0.06  ZFS directory
  9  4.50K   4.50K   9.00K  1K1.00 0.00  ZFS master node
 12   482K   20.0K   40.0K   3.33K   24.10 0.00  ZFS delete queue
  8.20M  66.1G   65.4G   65.8G   8.03K1.0110.54  zvol object
  1512 512  1K  1K1.00 0.00  zvol prop
  -  -   -   -   -   --  other uint8[]
  -  -   -   -   -   --  other uint64[]
  -  -   -   -   -   --  other ZAP
  -  -   -   -   -   --  persistent error log
  1   128K   10.5K   31.5K   31.5K   12.19 0.00  SPA history
  -  -   -   -   -   --  SPA history offsets
  -  -   -   -   -   --  Pool properties
  -  -   -   -   -   --  DSL permissions
  -  -   -   -   -   --  ZFS ACL
  -  -   -   -   -   --  ZFS SYSACL
  -  -   -   -   -   --  FUID table
  -  -   -   -   -   --  FUID table size
  5  3.00K   2.50K   7.50K   1.50K1.20 0.00  DSL dataset next 
 clones
  -  -   -   -   -   --  scrub work queue
  14.5M   626G623G624G   43.1K1.00   100.00  Total
 
 
 real21m16.862s
 user0m36.984s
 sys 0m5.757s
 
 ===
 Looking at the data:
 [EMAIL PROTECTED] ~$ zfs list backup  zpool list backup
 NAME USED  AVAIL  REFER  MOUNTPOINT
 backup   685G   237K27K  /backup
 NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
 backup   696G   671G  25.1G96%  ONLINE  -
 
 So zdb says 626GB is used, zfs list says 685GB is used, and zpool list says 
 671GB is used.  The pool was filled to 100% capacity via dd, this is 
 confirmed, I can't write data, but yet zpool list says its only 96%. 
 
 benr.
 --
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org

Re: [zfs-discuss] questions about replacing a raidz2 vdev disk with a larger one

2008-10-11 Thread Jeff Bonwick
ZFS will allow the replacement.  The available size is, however,
be determined by the smallest of the lot.  Once you've replaced
*all* 500GB disks with 1TB disks, the available space will double.

One suggestion: replace as many disks as you intend to at the same time,
so that ZFS only has to do one resilver operation.  It's faster that way.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] questions about replacing a raidz2 vdev disk with a larger one

2008-10-11 Thread Jeff Bonwick
Actually, you can replace them all at once, as long as you don't unplug
the old ones first.  Let's say you have a raidz2 setup like this:

mypool
raidz2
a
b
c
d

and you say this:

# zpool replace mypool a A
# zpool replace mypool b B
# zpool replace mypool c C
# zpool replace mypool d D

Your pool configuration will then become:

mypool
raidz2
replacing
a
A
replacing
b
B
replacing
c
C
replacing
d
D

The original drives (a, b, c, d) will remain in the pool until the
new drives (A, B, C, D) have all the data, at which point the old
drives will be detached and the final pool configuration will be:

mypool
raidz2
A
B
C
D

This assumes, of course, that you have enough slots to plug them all in.
If you're slot-limited -- i.e. you can't add a new drive without pulling
and old one -- then Eric is right, and in fact I'd go further: in that
case, replace only one at a time so you maintain the ability to survive
a disk failing while you're going all this.

Jeff

On Sat, Oct 11, 2008 at 06:37:17PM -0700, Erik Trimble wrote:
 Jeff Bonwick wrote:
 One suggestion: replace as many disks as you intend to at the same time,
 so that ZFS only has to do one resilver operation.  It's faster that way.
 
 Jeff
   
 Just to be more clear on this: 
 
 Assuming you have data you care about on the current raidz2 zpool, you 
 should replace UP TO [2] drives at once.  That way, you minimize 
 re-silver times, while keeping all your data intact.  If you replace 
 more than 2 at ones, you'll destroy the array's redundancy, and have to 
 restore the data from backup.  If you replace one at a time, you'll have 
 to wait for each to resilver before replacing anymore.
 
 If you don't care about the data, then, just destroy the zpool, replace 
 the drives, and recreate the zpool from scratch. It's faster and easier 
 than waiting for the resilvers.
 
 
 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Jeff Bonwick
 The circumstances where I have lost data have been when ZFS has not
 handled a layer of redundancy.  However, I am not terribly optimistic
 of the prospects of ZFS on any device that hasn't committed writes
 that ZFS thinks are committed.

FYI, I'm working on a workaround for broken devices.  As you note,
some disks flat-out lie: you issue the synchronize-cache command,
they say got it, boss, yet the data is still not on stable storage.
Why do they do this?  Because it performs better.  Well, duh --
you can make stuff *really* fast if it doesn't have to be correct.

Before I explain how ZFS can fix this, I need to get something off my
chest: people who knowingly make such disks should be in federal prison.
It is *fraud* to win benchmarks this way.  Doing so causes real harm
to real people.  Same goes for NFS implementations that ignore sync.
We have specifications for a reason.  People assume that you honor them,
and build higher-level systems on top of them.  Change the mass of
the proton by a few percent, and the stars explode.  It is impossible
to build a functioning civil society in a culture that tolerates lies.
We need a little more Code of Hammurabi in the storage industry.

Now:

The uberblock ring buffer in ZFS gives us a way to cope with this,
as long as we don't reuse freed blocks for a few transaction groups.
The basic idea: if we can't read the pool startign from the most
recent uberblock, then we should be able to use the one before it,
or the one before that, etc, as long as we haven't yet reused any
blocks that were freed in those earlier txgs.  This allows us to
use the normal load on the pool, plus the passage of time, as a
displacement flush for disk caches that ignore the sync command.

If we go back far enough in (txg) time, we will eventually find an
uberblock all of whose dependent data blocks have make it to disk.
I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Jeff Bonwick
 Or is there a way to mitigate a checksum error on non-redundant zpool?

It's just like the difference between non-parity, parity, and ECC memory.
Most filesystems don't have checksums (non-parity), so they don't even
know when they're returning corrupt data.  ZFS without any replication
can detect errors, but can't fix them (like parity memory).  ZFS with
mirroring or RAID-Z can both detect and correct (like ECC memory).

Note: even in a single-device pool, ZFS metadata is replicated via
ditto blocks at two or three different places on the device, so that
a localized media failure can be both detected and corrected.
If you have two or more devices, even without any mirroring
or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
across those devices.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool file corruption

2008-09-25 Thread Jeff Bonwick
It's almost certainly the SIL3114 controller.
Google SIL3114 data corruption -- it's nasty.

Jeff

On Thu, Sep 25, 2008 at 07:50:01AM +0200, Mikael Karlsson wrote:
 I have a strange problem involving changes in large file on a mirrored 
 zpool in
 Open solaris snv96.
 We use it at storage in a VMware ESXi lab environment. All virtual disk 
 files gets
 corrupted when changes are made within the files (when running the 
 machine that is).
 
 The sad thing is that I've created about ~200Gb of random data in 
 large files and
 even modified those files without any problem (using dd with skip and 
 conv=notrunc options).
 I've copied the files within the pool and over the network on all 
 network interfaces
 on the machine - without problems.
 
 It's just those .vmdk files that gets corrupted.
 
 The hardware is an Opteron desktop machine with a SIL3114 sata 
 interface. Personally I have exactly
 the same interface at home with the same setup without problem. Only the 
 other hardware differs (disks and so on).
 
 The disks are WD7500AACS, which is those with variable rotation speed 
 5400-7200. Could it
 be the disks? Could it be the disk controller or the rest of the 
 hardware?? I should mention that the
 controller has been flashed with a non-raid bios.
 
 I could provide more information if needed! Is there anyone that have 
 any ideas or suggestions?
 
 
 Some output:
 
 bash-3.00# zpool status -vx
   pool: testing
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub completed with 1 errors on Wed Sep 24 16:59:13 2008
 config:
 
 NAMESTATE READ WRITE CKSUM
 testing ONLINE   0 016
   mirrorONLINE   0 016
 c0d1ONLINE   0 051
 c1d1ONLINE   0 054
 
 errors: Permanent errors have been detected in the following files:
 
 /testing/ZFS-problem/ZFS-problem-flat.vmdk
 
 
 Regards
 
 Mikael
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remove log device?

2008-07-13 Thread Jeff Bonwick
You are correct, and it is indeed annoying.  I hope to have this
fixed by the end of the month.

Jeff

On Sun, Jul 13, 2008 at 10:16:55PM -0500, Mike Gerdts wrote:
 It seems as though there is no way to remove a log device once it is
 added.  Is this correct?
 
 Assuming this is correct, is there any reason that adding the ability
 to remove the log device would be particularly tricky?
 
 -- 
 Mike Gerdts
 http://mgerdts.blogspot.com/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scrub never finishes

2008-07-13 Thread Jeff Bonwick
ZFS co-inventor Matt Ahrens recently fixed this:

6343667 scrub/resilver has to start over when a snapshot is taken

Trust me when I tell you that solving this correctly was much harder
than you might expect.  Thanks again, Matt.

Jeff

On Sun, Jul 13, 2008 at 07:08:48PM -0700, Anil Jangity wrote:
 Oh, my hunch was right. Yup, I do have an hourly snapshot going. I'll 
 take it out and see.
 
 Thanks!
 
 
 Bob Friesenhahn wrote:
  On Sun, 13 Jul 2008, Anil Jangity wrote:
 

  On one of the pools, I started a scrub. It never finishes. At one time,
  I saw it go up to like 70% and then a little bit later I ran the pool
  status, it went back to 5% and started again.
 
  What is going on? Here is the pool layout:
  
 
  Initiating a snapshot stops the scrub.  I don't know if the scrub is 
  restarted at 0%, or simply aborted.  Are you taking snapshots during 
  the scrub?
 
  Bob
 
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scrub failing to initialise

2008-07-11 Thread Jeff Bonwick
If the cabling outage was transient, the disk driver would simply retry
until they came back.  If it's a hotplug-capable bus and the disks were
flagged as missing, ZFS would by default wait until the disks came back
(see zpool get failmode pool), and complete the I/O then.  There would
be no missing disk writes, hence nothing to resilver.

Jeff

On Mon, Jul 07, 2008 at 06:55:02PM +0200, Justin Vassallo wrote:
 Hi,
 
  
 
 I've got a zpool made up of 2 mirrored vdevs. For one moment i had a cabling
 problem and lost all disks... i reconnected and onlined the disks. No
 resilvering kicked in, so i tried to force a scrub, but nothing's happening.
 I issue the command and it's as if i never did.
 
  
 
 Any suggestions?
 
  
 
 Thanks
 
 justin
 



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] is it possible to add a mirror device later?

2008-07-06 Thread Jeff Bonwick
I would just swap the physical locations of the drives, so that the
second half of the mirror is in the right location to be bootable.
ZFS won't mind -- it tracks the disks by content, not by pathname.
Note that SATA is not hotplug-happy, so you're probably best off
doing this while the box is powered off.  Upon reboot, ZFS should
figure out what happened, update the device paths, and... that's it.

Jeff

On Sun, Jul 06, 2008 at 08:47:25AM +0200, Tommaso Boccali wrote:
 
  As Edna and Robert mentioned, zpool attach will add the mirror.
  But note that the X4500 has only two possible boot devices:
  c5t0d0 and c5t4d0.  This is a BIOS limitation.  So you will want
  to mirror with c5t4d0 and configure the disks for boot.  See the
  docs on ZFS boot for details on how to configure the boot sectors
  and grub.
  -- richard
 
 
 uhm, bad.
 
 I did not know this, so now the root is
 bash-3.2# zpool status rpool
   pool: rpool
  state: ONLINE
  scrub: resilver completed after 0h8m with 0 errors on Wed Jul  2 16:09:14 
 2008
 config:
 
 NAME  STATE READ WRITE CKSUM
 rpool ONLINE   0 0 0
   mirror  ONLINE   0 0 0
 c5t0d0s0  ONLINE   0 0 0
 c1t7d0ONLINE   0 0 0
 spares
   c0t7d0  AVAIL
   c1t6d0  AVAIL
 
 
 while c5t4d0 belongs to a raiz pool:
 
 ...
   raidz1ONLINE   0 0 0
 c0t4d0  ONLINE   0 0 0
 c1t4d0  ONLINE   0 0 0
 c5t4d0  ONLINE   0 0 0
 c6t7d0  ONLINE   0 0 0
 c5t5d0  ONLINE   0 0 0
 c5t6d0  ONLINE   0 0 0
 c5t7d0  ONLINE   0 0 0
 c1t5d0  ONLINE   0 0 0
 ...
 
 is it possible to restore the good behavior?
 something like
 - detach c1t7d0 from rpool
 - detach c5t4d0 from the other pool (the pool still survives since it is 
 raidz)
 - reattach in reverse order? (and so reform mirror and raidz?)
 
 thanks a lot again
 
 tommaso
 
 
 
 
 -- 
 Tommaso Boccali
 INFN Pisa
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] confusion and frustration with zpool

2008-07-06 Thread Jeff Bonwick
As a first step, 'fmdump -ev' should indicate why it's complaining
about the mirror.

Jeff

On Sun, Jul 06, 2008 at 07:55:22AM -0700, Pete Hartman wrote:
 I'm doing another scrub after clearing insufficient replicas only to find 
 that I'm back to the report of insufficient replicas, which basically leads 
 me to expect this scrub (due to complete in about 5 hours from now) won't 
 have any benefit either.
 
 -bash-3.2#  zpool status local
   pool: local
  state: FAULTED
  scrub: scrub in progress for 0h32m, 9.51% done, 5h11m to go
 config:
 
 NAME  STATE READ WRITE CKSUM
 local FAULTED  0 0 0  insufficient replicas
   mirror  ONLINE   0 0 0
 c6d1p0ONLINE   0 0 0
 c0t0d0s3  ONLINE   0 0 0
   mirror  ONLINE   0 0 0
 c6d0p0ONLINE   0 0 0
 c0t0d0s4  ONLINE   0 0 0
   mirror  UNAVAIL  0 0 0  corrupted data
 c8t0d0p0  ONLINE   0 0 0
 c0t0d0s5  ONLINE   0 0 0
 
 errors: No known data errors
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bug id 6343667

2008-07-05 Thread Jeff Bonwick
FYI, we are literally just days from having this fixed.

Matt: after putback you really should blog about this one --
both to let people know that this long-standing bug has been
fixed, and to describe your approach to it.

It's a surprisingly tricky and interesting problem.

Jeff

On Sat, Jul 05, 2008 at 01:20:11PM -0700, Ross wrote:
 If it ever does get released I'd love to hear about it.  That bug, and the 
 fact it appears to have been outstanding for three years, was one of the 
 major reasons behind us not purchasing a bunch of x4500's.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Changing GUID

2008-07-02 Thread Jeff Bonwick
 How difficult would it be to write some code to change the GUID of a pool?

As a recreational hack, not hard at all.  But I cannot recommend it
in good conscience, because if the pool contains more than one disk,
the GUID change cannot possibly be atomic.  If you were to crash or
lose power in the middle of the operation, your data would be gone.

What problem are you trying to solve?

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume

2008-07-01 Thread Jeff Bonwick
 To be honest, it is not quite clear to me, how we might utilize
 dumpadm(1M) to help us to calculate/recommend size of dump device.
 Could you please elaborate more on this ?

dumpadm(1M) -c specifies the dump content, which can be kernel, kernel plus
current process, or all memory.  If the dump content is 'all', the dump space
needs to be as large as physical memory.  If it's just 'kernel', it can be
some fraction of that.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume

2008-07-01 Thread Jeff Bonwick
 The problem is that size-capping is the only control we have over
 thrashing right now.

It's not just thrashing, it's also any application that leaks memory.
Without a cap, the broken application would continue plowing through
memory until it had consumed every free block in the storage pool.

What we really want is dynamic allocation with lower and upper bounds
to ensure that there's always enough swap space, and that a reasonable
upper limit isn't exceeded.  As fortune would have it, that's exactly
what we get with quotas and reservations on zvol-based swap today.

If you prefer uncapped behavior, no problem -- unset the reservation
and grow the swap zvol to 16EB.

(Ultimately it would be cleaner to express this more directly, rather
than via the nominal size of an emulated volume.  The VM 2.0 project
will address that, along with many other long-standing annoyances.)

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume

2008-06-30 Thread Jeff Bonwick
 Neither swap or dump are mandatory for running Solaris.

Dump is mandatory in the sense that losing crash dumps is criminal.

Swap is more complex.  It's certainly not mandatory.  Not so long ago,
swap was typically larger than physical memory.  But in recent years,
we've essentially moved to a world in which paging is considered a bug.
Swap devices are often only a fraction of physical memory size now,
which raises the question of why we even bother.  On my desktop, which
has 16GB of memory, the default OpenSolaris swap partition is 2GB.
That's just stupid.  Unless swap space significantly expands the
amount of addressable virtual memory, there's no reason to have it.

There have been a number of good suggestions here:

(1) The right way to size the dump device is to let dumpadm(1M) do it
based on the dump content type.

(2) In a virtualized environment, a better way to get a crash dump
would be to snapshot the VM.  This would require a little bit
of host/guest cooperation, in that the installer (or dumpadm)
would have to know that it's operating in a VM, and the kernel
would need some way to notify the VM that it just panicked.
Both of these ought to be doable.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool with RAID-5 from intelligent storage arrays

2008-06-30 Thread Jeff Bonwick
Using ZFS to mirror two hardware RAID-5 LUNs is actually quite nice.
Because the data is mirrored at the ZFS level, you get all the benefits
of self-healing.  Moreover, you can survive a great variety of hardware
failures: three or more disks can die (one in the first array, two or
more in the second), failure of a cable, or failure of an entire array.

Jeff

On Sat, Jun 14, 2008 at 08:09:49AM -0700, zfsmonk wrote:
 Mentioned on 
 http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide is 
 the following:
 ZFS works well with storage based protected LUNs (RAID-5 or mirrored LUNs 
 from intelligent storage arrays). However, ZFS cannot heal corrupted blocks 
 that are detected by ZFS checksums.
 
 based upon that, if we have LUNs already in RAID5 being served from 
 intelligent storage arrays, is it any benefit to create the zpool in a mirror 
 if zfs can't heal any corrupted blocks? Or would we just be wasting disk 
 space?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Deferred Frees

2008-06-30 Thread Jeff Bonwick
When a block is freed as part of transaction group N, it can be reused
in transaction group N+1.  There's at most a one-txg (few-second) delay.

Jeff

On Mon, Jun 16, 2008 at 01:02:53PM -0400, Torrey McMahon wrote:
 I'm doing some simple testing of ZFS block reuse and was wondering when 
 deferred frees kick in. Is it on some sort of timer to ensure data 
 consistency? Does an other routine call it? Would something as simple as 
 sync(1M) get the free block list written out so future allocations could 
 use the space?
 
 ... or am I way off in the weeds? :)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs mirror broken?

2008-06-20 Thread Jeff Bonwick
If you say 'zpool online pool disk' that should tell ZFS that
the disk is healthy again and automatically kick off a resilver.

Of course, that should have happened automatically.  What version
of ZFS / Solaris are you running?

Jeff

On Fri, Jun 20, 2008 at 06:01:25PM +0200, Justin Vassallo wrote:
 Hi,
 
  
 
 I have a zpool made of 2 vdev mirrors, with disks connected via USB hub.
 
  
 
 While one vdev was resilvering at 22% (HD replacement), the original disk
 went away (seems the USB hub is the culprit). I turned the disk off and back
 on. The status of the disk came back to ONLINE, but there is no resilvering
 happening. Disks are cool and idle.
 
  
 
 An clues what could be happening here? Should i plug out / in the new disk
 again?
 
  
 
 I can't check what status the data is in, because it was being  used by a
 non-global zone which is failing to start, but that's another porblem:
 
  
 
 # zoneadm -z ZONE boot
 
 could not verify fs /data: could not access /tank/data: No such file or
 directory
 
 zoneadm: zone ZONE failed to verify
 
  
 
  
 
 justin
 



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cannot delete errored file

2008-06-10 Thread Jeff Bonwick
That's odd -- the only way the 'rm' should fail is if it can't
read the znode for that file.  The znode is metadata, and is
therefore stored in two distinct places using ditto blocks.
So even if you had one unlucky copy that was damaged on two
of your disks, you should still have another copy elsewhere.

Assuming you weren't so shockingly unlucky, the only way to
get a corrupted znode that I know of is flaky memory, such that
the znode is checksummed, then the DRAM flips a bit, then you
write the znode to disk.  The fact that you've seen so many
checksum errors makes me suspect hardware all the more.

Can you send me the output of fmdump -ev and fmdump -eV ?
There should be some useful crumbs in there...

Jeff

On Tue, Jun 03, 2008 at 04:27:21AM -0700, Ben Middleton wrote:
 Hi,
 
 I can't seem to delete a file in my zpool that has permanent errors:
 
 zpool status -vx
   pool: rpool
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub completed after 2h10m with 1 errors on Tue Jun  3 11:36:49 2008
 config:
 
 NAMESTATE READ WRITE CKSUM
 rpool   ONLINE   0 0 0
   raidz1ONLINE   0 0 0
 c0t0d0  ONLINE   0 0 0
 c0t1d0  ONLINE   0 0 0
 c0t2d0  ONLINE   0 0 0
 
 errors: Permanent errors have been detected in the following files:
 
 /export/duke/test/Acoustic/3466/88832/09 - Check.mp3
 
 
 rm /export/duke/test/Acoustic/3466/88832/09 - Check.mp3
 
 rm: cannot remove `/export/duke/test/Acoustic/3466/88832/09 - Check.mp3': I/O 
 error
 
 Each time I try to do anything to the file, the checksum error count goes up 
 on the pool.
 
 I also tried a mv and a cp over the top - but same I/O error.
 
 I performed a zpool scrub rpool followed by a zpool clear rpool - but 
 still get the same error. Any ideas?
 
 PS - I'm running snv_86, and use the sata driver on an intel x86 architecture.
 
 B
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [caiman-discuss] disk names?

2008-06-04 Thread Jeff Bonwick
I agree with that.  format(1M) and cfgadm(1M) are, ah, not the most
user-friendly tools.  It would be really nice to have 'zpool disks'
go out and taste all the drives to see which ones are available.

We already have most of the code to do it.  'zpool import' already
contains the taste-all-disks-and-slices logic, and 'zpool add'
already contains the logic to determine whether a device is in use.
Looks like all we're really missing is a call to printf()...

Is there an RFE for this?  If not, I'll file one.  I like the idea.

Jeff

On Wed, Jun 04, 2008 at 10:55:18AM -0500, Bob Friesenhahn wrote:
 On Tue, 3 Jun 2008, Dave Miner wrote:
 
  Putting into the zpool command would feel odd to me, but I agree that
  there may be a useful utility here.
 
 There is value to putting this functionality in zpool for the same 
 reason that it was useful to put 'iostat' and other duplicate 
 functionality in zpool.  For example, zpool can skip disks which are 
 already currently in use, or it can recommend whole disks (rather than 
 partitions) if none of the logical disk partitions are currently in 
 use.
 
 The zfs commands are currently at least an order of magnitude easier 
 to comprehend and use than the legacy commands related to storage 
 devices.  It would be nice if the zfs commands will continue to 
 simplify what is now quite obtuse.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with raidz

2008-05-30 Thread Jeff Bonwick
Very cool!  Just one comment.  You said:

 We'll try compression level #9.

gzip-9 is *really* CPU-intensive, often for little gain over gzip-1.
As in, it can take 100 times longer and yield just a few percent gain.
The CPU cost will limit write bandwidth to a few MB/sec per core.

I'd suggest that you begin by doing a simple experiment -- create a
filesystem at each compression level, copy representative identical
data to each one, and compare space usage.  My guess is that you'll
find the knee in the cost/benefit curve well below gzip-9.  Also,
if you're storing jpegs or video files, those are already compressed,
in which case the benefit will zero even at gzip-9.

That said, the other consideration is how you're using the storage.
If the write rate is modest and disk space is at a premium, the CPU
cost may simply not matter.  And note that only writes are affected:
when reading data back, gzip is equally fast regardless of level.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering data from a dettach mirrored vdev

2008-05-07 Thread Jeff Bonwick
Yes, I think that would be useful.  Something like 'zpool revive'
or 'zpool undead'.  It would not be completely general-purpose --
in a pool with multiple mirror devices, it could only work if
all replicas were detached in the same txg -- but for the simple
case of a single top-level mirror vdev, or a clean 'zpool split',
it's actually pretty straightforward.

Jeff

On Tue, May 06, 2008 at 11:16:25AM +0100, Darren J Moffat wrote:
 Great tool, any chance we can have it integrated into zpool(1M) so that 
 it can find and fixup on import detached vdevs as new pools ?
 
 I'd think it would be reasonable to extend the meaning of
 'zpool import -D' to list detached vdevs as well as destroyed pools.
 
 --
 Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering data from a dettach mirrored vdev

2008-05-04 Thread Jeff Bonwick
Oh, you're right!  Well, that will simplify things!  All we have to do
is convince a few bits of code to ignore ub_txg == 0.  I'll try a
couple of things and get back to you in a few hours...

Jeff

On Fri, May 02, 2008 at 03:31:52AM -0700, Benjamin Brumaire wrote:
 Hi,
 
 while diving deeply in zfs in order to recover data I found that every 
 uberblock in label0 does have the same ub_rootbp and a zeroed ub_txg. Does it 
 means only ub_txg was touch while detaching?  
 
 Hoping  it is the case, I modified ub_txg from one uberblock to match the tgx 
 from the label and now I try to  calculate the new SHA256 checksum but I 
 failed. Can someone explain what I did wrong? And of course how to do it 
 correctly?
 
 bbr
 
 
 The example is from a valid uberblock which belongs an other pool.
 
 Dumping the active uberblock in Label 0:
 
 # dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=1024 | od -x 
 1024+0 records in
 1024+0 records out
 000 b10c 00ba   0009   
 020 8bf2    8eef f6db c46f 4dcc
 040 bba8 481a   0001   
 060 05e6 0003   0001   
 100 05e6 005b   0001   
 120 44e9 00b2   0001  0703 800b
 140        
 160     8bf2   
 200 0018    a981 2f65 0008 
 220 e734 adf2 037a  cedc d398 c063 
 240 da03 8a6e 26fc 001c    
 260        
 *
 0001720     7a11 b10c da7a 0210
 0001740 3836 20fb e2a7 a737 a947 feed 43c5 c045
 0001760 82a8 133d 0ba7 9ce7 e5d5 64e2 2474 3b03
 0002000
 
 Checksum is at pos 01740 01760
 
 I try to calculate it assuming only uberblock is relevant. 
 #dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=168 | digest -a sha256
 168+0 records in
 168+0 records out
 710306650facf818e824db5621be394f3b3fe934107bdfc861bbc82cb9e1bbf3
 
 Helas not matching  :-(
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] lost zpool when server restarted.

2008-05-04 Thread Jeff Bonwick
It's OK that you're missing labels 2 and 3 -- there are four copies
precisely so that you can afford to lose a few.  Labels 2 and 3
are at the end of the disk.  The fact that only they are missing
makes me wonder if someone resized the LUNs.  Growing them would
be OK, but shrinking them would indeed cause the pool to fail to
open (since part of it was amputated).

There ought to be more helpful diagnostics in the FMA error log.
After a failed attempt to import, type this:

# fmdump -ev

and let me know what it says.

Jeff

On Tue, Apr 29, 2008 at 03:31:53PM -0400, Krzys wrote:
 
 
 
 I have a problem on one of my systems with zfs. I used to have zpool created 
 with 3 luns on SAN. I did not have to put any raid or anything on it since it 
 was already using raid on SAN. Anyway server rebooted and I cannot zee my 
 pools. 
 When I do try to import it it does fail. I am using EMC Clarion as SAN and 
 powerpath
 # zpool list
 no pools available
 # zpool import -f
   pool: mypool
   id: 4148251638983938048
 state: FAULTED
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
   devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-3C
 config:
   mypool UNAVAIL insufficient replicas
   emcpower0a UNAVAIL cannot open
   emcpower2a UNAVAIL cannot open
   emcpower3a ONLINE
 
 I think I am able to see all the luns and I should be able to access them on 
 my 
 sun box.
 # powermt display dev=all
 Pseudo name=emcpower0a
 CLARiiON ID=APM00070202835 [NRHAPP02]
 Logical device ID=6006016045201A001264FB20990FDC11 [LUN 13]
 state=alive; policy=CLAROpt; priority=0; queued-IOs=0
 Owner: default=SP B, current=SP B
 ==
  Host --- - Stor - -- I/O Path - -- Stats ---
 ### HW Path I/O Paths Interf. Mode State Q-IOs Errors
 ==
 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 c2t5006016041E035A4d0s0 SP A4 active 
 alive 0 0
 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 c2t5006016941E035A4d0s0 SP B5 active 
 alive 0 0
 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d0s0 SP A5 
 active alive 0 0
 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d0s0 SP B4 
 active alive 0 0
 
 
 Pseudo name=emcpower1a
 CLARiiON ID=APM00070202835 [NRHAPP02]
 Logical device ID=6006016045201A004C1388343C10DC11 [LUN 14]
 state=alive; policy=CLAROpt; priority=0; queued-IOs=0
 Owner: default=SP B, current=SP B
 ==
  Host --- - Stor - -- I/O Path - -- Stats ---
 ### HW Path I/O Paths Interf. Mode State Q-IOs Errors
 ==
 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 c2t5006016041E035A4d1s0 SP A4 active 
 alive 0 0
 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 c2t5006016941E035A4d1s0 SP B5 active 
 alive 0 0
 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d1s0 SP A5 
 active alive 0 0
 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d1s0 SP B4 
 active alive 0 0
 
 
 Pseudo name=emcpower3a
 CLARiiON ID=APM00070202835 [NRHAPP02]
 Logical device ID=6006016045201A00A82C68514E86DC11 [LUN 7]
 state=alive; policy=CLAROpt; priority=0; queued-IOs=0
 Owner: default=SP B, current=SP B
 ==
  Host --- - Stor - -- I/O Path - -- Stats ---
 ### HW Path I/O Paths Interf. Mode State Q-IOs Errors
 ==
 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 c2t5006016041E035A4d3s0 SP A4 active 
 alive 0 0
 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL 
 PROTECTED],0 c2t5006016941E035A4d3s0 SP B5 active 
 alive 0 0
 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d3s0 SP A5 
 active alive 0 0
 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL 
 PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d3s0 SP B4 
 active alive 0 0
 
 
 Pseudo name=emcpower2a
 CLARiiON ID=APM00070202835 [NRHAPP02]
 Logical device ID=600601604B141B00C2F6DB2AC349DC11 [LUN 24]
 state=alive; policy=CLAROpt; priority=0; queued-IOs=0
 Owner: default=SP B, current=SP B
 

Re: [zfs-discuss] lost zpool when server restarted.

2008-05-04 Thread Jeff Bonwick
 Looking at the txg numbers, it's clear that labels on to devices that
 are unavailable now may be stale:

Actually, they look OK.  The txg values in the label indicate the
last txg in which the pool configuration changed for devices in that
top-level vdev (e.g. mirror or raid-z group), not the last txg synced.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering data from a dettach mirrored vdev

2008-05-04 Thread Jeff Bonwick
OK, here you go.  I've successfully recovered a pool from a detached
device using the attached binary.  You can verify its integrity
against the following MD5 hash:

# md5sum labelfix
ab4f33d99fdb48d9d20ee62b49f11e20  labelfix

It takes just one argument -- the disk to repair:

# ./labelfix /dev/rdsk/c0d1s4

If all goes according to plan, your old pool should be importable.
If you do a zpool status -v, it will complain that the old mirrors
are no longer there.  You can clean that up by detaching them:

# zpool detach mypool guid

where guid is the long integer that zpool status -v reports
as the name of the missing device.

Good luck, and please let us know how it goes!

Jeff

On Sat, May 03, 2008 at 10:48:34PM -0700, Jeff Bonwick wrote:
 Oh, you're right!  Well, that will simplify things!  All we have to do
 is convince a few bits of code to ignore ub_txg == 0.  I'll try a
 couple of things and get back to you in a few hours...
 
 Jeff
 
 On Fri, May 02, 2008 at 03:31:52AM -0700, Benjamin Brumaire wrote:
  Hi,
  
  while diving deeply in zfs in order to recover data I found that every 
  uberblock in label0 does have the same ub_rootbp and a zeroed ub_txg. Does 
  it means only ub_txg was touch while detaching?  
  
  Hoping  it is the case, I modified ub_txg from one uberblock to match the 
  tgx from the label and now I try to  calculate the new SHA256 checksum but 
  I failed. Can someone explain what I did wrong? And of course how to do it 
  correctly?
  
  bbr
  
  
  The example is from a valid uberblock which belongs an other pool.
  
  Dumping the active uberblock in Label 0:
  
  # dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=1024 | od -x 
  1024+0 records in
  1024+0 records out
  000 b10c 00ba   0009   
  020 8bf2    8eef f6db c46f 4dcc
  040 bba8 481a   0001   
  060 05e6 0003   0001   
  100 05e6 005b   0001   
  120 44e9 00b2   0001  0703 800b
  140        
  160     8bf2   
  200 0018    a981 2f65 0008 
  220 e734 adf2 037a  cedc d398 c063 
  240 da03 8a6e 26fc 001c    
  260        
  *
  0001720     7a11 b10c da7a 0210
  0001740 3836 20fb e2a7 a737 a947 feed 43c5 c045
  0001760 82a8 133d 0ba7 9ce7 e5d5 64e2 2474 3b03
  0002000
  
  Checksum is at pos 01740 01760
  
  I try to calculate it assuming only uberblock is relevant. 
  #dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=168 | digest -a sha256
  168+0 records in
  168+0 records out
  710306650facf818e824db5621be394f3b3fe934107bdfc861bbc82cb9e1bbf3
  
  Helas not matching  :-(
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


labelfix
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering data from a dettach mirrored vdev

2008-05-04 Thread Jeff Bonwick
Oh, and here's the source code, for the curious:

#include devid.h
#include dirent.h
#include errno.h
#include libintl.h
#include stdlib.h
#include string.h
#include sys/stat.h
#include unistd.h
#include fcntl.h
#include stddef.h

#include sys/vdev_impl.h

/*
 * Write a label block with a ZBT checksum.
 */
static void
label_write(int fd, uint64_t offset, uint64_t size, void *buf)
{
zio_block_tail_t *zbt, zbt_orig;
zio_cksum_t zc;

zbt = (zio_block_tail_t *)((char *)buf + size) - 1;
zbt_orig = *zbt;

ZIO_SET_CHECKSUM(zbt-zbt_cksum, offset, 0, 0, 0);

zio_checksum(ZIO_CHECKSUM_LABEL, zc, buf, size);

VERIFY(pwrite64(fd, buf, size, offset) == size);

*zbt = zbt_orig;
}

int
main(int argc, char **argv)
{
int fd;
vdev_label_t vl;
nvlist_t *config;
uberblock_t *ub = (uberblock_t *)vl.vl_uberblock;
uint64_t txg;
char *buf;
size_t buflen;

VERIFY(argc == 2);
VERIFY((fd = open(argv[1], O_RDWR)) != -1);
VERIFY(pread64(fd, vl, sizeof (vdev_label_t), 0) ==
sizeof (vdev_label_t));
VERIFY(nvlist_unpack(vl.vl_vdev_phys.vp_nvlist,
sizeof (vl.vl_vdev_phys.vp_nvlist), config, 0) == 0);
VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0);
VERIFY(txg == 0);
VERIFY(ub-ub_txg == 0);
VERIFY(ub-ub_rootbp.blk_birth != 0);

txg = ub-ub_rootbp.blk_birth;
ub-ub_txg = txg;

VERIFY(nvlist_remove_all(config, ZPOOL_CONFIG_POOL_TXG) == 0);
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0);
buf = vl.vl_vdev_phys.vp_nvlist;
buflen = sizeof (vl.vl_vdev_phys.vp_nvlist);
VERIFY(nvlist_pack(config, buf, buflen, NV_ENCODE_XDR, 0) == 0);

label_write(fd, offsetof(vdev_label_t, vl_uberblock),
1ULL  UBERBLOCK_SHIFT, ub);

label_write(fd, offsetof(vdev_label_t, vl_vdev_phys),
VDEV_PHYS_SIZE, vl.vl_vdev_phys);

fsync(fd);

return (0);
}

Jeff

On Sun, May 04, 2008 at 01:21:27AM -0700, Jeff Bonwick wrote:
 OK, here you go.  I've successfully recovered a pool from a detached
 device using the attached binary.  You can verify its integrity
 against the following MD5 hash:
 
 # md5sum labelfix
 ab4f33d99fdb48d9d20ee62b49f11e20  labelfix
 
 It takes just one argument -- the disk to repair:
 
 # ./labelfix /dev/rdsk/c0d1s4
 
 If all goes according to plan, your old pool should be importable.
 If you do a zpool status -v, it will complain that the old mirrors
 are no longer there.  You can clean that up by detaching them:
 
 # zpool detach mypool guid
 
 where guid is the long integer that zpool status -v reports
 as the name of the missing device.
 
 Good luck, and please let us know how it goes!
 
 Jeff
 
 On Sat, May 03, 2008 at 10:48:34PM -0700, Jeff Bonwick wrote:
  Oh, you're right!  Well, that will simplify things!  All we have to do
  is convince a few bits of code to ignore ub_txg == 0.  I'll try a
  couple of things and get back to you in a few hours...
  
  Jeff
  
  On Fri, May 02, 2008 at 03:31:52AM -0700, Benjamin Brumaire wrote:
   Hi,
   
   while diving deeply in zfs in order to recover data I found that every 
   uberblock in label0 does have the same ub_rootbp and a zeroed ub_txg. 
   Does it means only ub_txg was touch while detaching?  
   
   Hoping  it is the case, I modified ub_txg from one uberblock to match the 
   tgx from the label and now I try to  calculate the new SHA256 checksum 
   but I failed. Can someone explain what I did wrong? And of course how to 
   do it correctly?
   
   bbr
   
   
   The example is from a valid uberblock which belongs an other pool.
   
   Dumping the active uberblock in Label 0:
   
   # dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=1024 | od -x 
   1024+0 records in
   1024+0 records out
   000 b10c 00ba   0009   
   020 8bf2    8eef f6db c46f 4dcc
   040 bba8 481a   0001   
   060 05e6 0003   0001   
   100 05e6 005b   0001   
   120 44e9 00b2   0001  0703 800b
   140        
   160     8bf2   
   200 0018    a981 2f65 0008 
   220 e734 adf2 037a  cedc d398 c063 
   240 da03 8a6e 26fc 001c    
   260        
   *
   0001720     7a11 b10c da7a 0210
   0001740 3836 20fb e2a7 a737 a947 feed 43c5 c045
   0001760 82a8 133d 0ba7 9ce7 e5d5 64e2 2474 3b03
   0002000
   
   Checksum is at pos 01740 01760
   
   I try to calculate it assuming only uberblock is relevant. 
   #dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=168 | digest -a sha256
   168+0 records in
   168+0 records out
   710306650facf818e824db5621be394f3b3fe934107bdfc861bbc82cb9e1bbf3

Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-04-30 Thread Jeff Bonwick
Indeed, things should be simpler with fewer (generally one) pool.

That said, I suspect I know the reason for the particular problem
you're seeing: we currently do a bit too much vdev-level caching.
Each vdev can have up to 10MB of cache.  With 132 pools, even if
each pool is just a single iSCSI device, that's 1.32GB of cache.

We need to fix this, obviously.  In the interim, you might try
setting zfs_vdev_cache_size to some smaller value, like 1MB.

Still, I'm curious -- why lots of pools?  Administration would
be simpler with a single pool containing many filesystems.

Jeff

On Wed, Apr 30, 2008 at 11:48:07AM -0700, Bill Moore wrote:
 A silly question:  Why are you using 132 ZFS pools as opposed to a
 single ZFS pool with 132 ZFS filesystems?
 
 
 --Bill
 
 On Wed, Apr 30, 2008 at 01:53:32PM -0400, Chris Siebenmann wrote:
   I have a test system with 132 (small) ZFS pools[*], as part of our
  work to validate a new ZFS-based fileserver environment. In testing,
  it appears that we can produce situations that will run the kernel out
  of memory, or at least out of some resource such that things start
  complaining 'bash: fork: Resource temporarily unavailable'. Sometimes
  the system locks up solid.
  
   I've found at least two situations that reliably do this:
  - trying to 'zpool scrub' each pool in sequence (waiting for each scrub
to complete before starting the next one).
  - starting simultaneous sequential read IO from all pools from a NFS client.
(trying to do the same IO from the server basically kills the server
entirely.)
  
   If I aggregate the same disk space into 12 pools instead of 132, the
  same IO load does not kill the system.
  
   The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB
  of swap, running 64-bit Solaris 10 U4 with an almost current set of
  patches; it gets the storage from another machine via ISCSI. The pools
  are non-redundant, with each vdev being a whole ISCSI LUN.
  
   Is this a known issue (or issues)? If this isn't a known issue, does
  anyone have pointers to good tools to trace down what might be happening
  and where memory is disappearing and so on? Does the system plain need
  more memory for this number of pools and if so, does anyone know how
  much?
  
   Thanks in advance.
  
  (I was pointed to mdb -k's '::kmastat' by some people on the OpenSolaris
  IRC channel but I haven't spotted anything particularly enlightening in
  its output, and I can't run it once the system has gone over the edge.)
  
  - cks
  [*: we have an outstanding uncertainty over how many ZFS pools a
  single system can sensibly support, so testing something larger
  than we'd use in production seemed sensible.]
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering data from a dettach mirrored vdev

2008-04-29 Thread Jeff Bonwick
If your entire pool consisted of a single mirror of two disks, A and B,
and you detached B at some point in the past, you *should* be able to
recover the pool as it existed when you detached B.  However, I just
tried that experiment on a test pool and it didn't work.  I will
investigate further and get back to you.  I suspect it's perfectly
doable, just currently disallowed due to some sort of error check
that's a little more conservative than necessary.  Keep that disk!

Jeff

On Mon, Apr 28, 2008 at 10:33:32PM -0700, Benjamin Brumaire wrote:
 Hi,
 
 my system (solaris b77) was physically destroyed and i loosed data saved in a 
 zpool mirror. The only thing left is a dettached vdev from the pool. I'm 
 aware that uberblock is gone and that i can't import the pool. But i still 
 hope their is a way or a tool (like tct http://www.porcupine.org/forensics/) 
 i can go too recover at least partially some data)
 
 thanks in advance for any hints.
 
 bbr
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] recovering data from a dettach mirrored vdev

2008-04-29 Thread Jeff Bonwick
Urgh.  This is going to be harder than I thought -- not impossible,
just hard.

When we detach a disk from a mirror, we write a new label to indicate
that the disk is no longer in use.  As a side effect, this zeroes out
all the old uberblocks.  That's the bad news -- you have no uberblocks.

The good news is that the uberblock only contains one field that's hard
to reconstruct: ub_rootbp, which points to the root of the block tree.
The root block *itself* is still there -- we just have to find it.

The root block has a known format: it's a compressed objset_phys_t,
almost certainly one sector in size (could be two, but very unlikely
because the root objset_phys_t is highly compressible).

It should be possible to write a program that scans the disk, reading
each sector and attempting to decompress it.  If it decompresses into
exactly 1K (size of an uncompressed objset_phys_t), then we can look
at all the fields to see if they look plausible.  Among all candidates
we find, the one whose embedded meta-dnode has the highest birth time
in its dn_blkptr is the one we want.

I need to get some sleep now, but I'll code this up in a couple of
days and we can take it from there.  If this is time-sensitive,
let me know and I'll see if I can find someone else to drive it.
[ I've got a bunch of commitments tomorrow, plus I'm supposed to
be on vacation... typical...  ;-)  ]

Jeff

On Tue, Apr 29, 2008 at 12:15:21AM -0700, Benjamin Brumaire wrote:
 Jeff thank you very much for taking time to look at this.
 
 My entire pool consisted of a single mirror of two slices on different disks 
 A and B. I attach a third slice on disk C and wait for resilver and then 
 detach it. Now disks A and B burned and I have only disk C at hand.
 
 bbr
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of one single 'cp'

2008-04-14 Thread Jeff Bonwick
No, that is definitely not expected.

One thing that can hose you is having a single disk that performs
really badly.  I've seen disks as slow as 5 MB/sec due to vibration,
bad sectors, etc.  To see if you have such a disk, try my diskqual.sh
script (below).  On my desktop system, which has 8 drives, I get:

# ./diskqual.sh
c1t0d0 65 MB/sec
c1t1d0 63 MB/sec
c2t0d0 59 MB/sec
c2t1d0 63 MB/sec
c3t0d0 60 MB/sec
c3t1d0 57 MB/sec
c4t0d0 61 MB/sec
c4t1d0 61 MB/sec

The diskqual test is non-destructive (it only does reads), but to
get valid numbers you should run it on an otherwise idle system.

--

#!/bin/ksh

disks=`format /dev/null | grep c.t.d | nawk '{print $2}'`

getspeed1()
{
ptime dd if=/dev/rdsk/${1}s0 of=/dev/null bs=64k count=1024 21 |
nawk '$1 == real { printf(%.0f\n, 67.108864 / $2) }'
}

getspeed()
{
for iter in 1 2 3
do
getspeed1 $1
done | sort -n | tail -2 | head -1
}

for disk in $disks
do
echo $disk `getspeed $disk` MB/sec
done

--

Jeff

On Tue, Apr 08, 2008 at 06:44:13AM -0700, Henrik Hjort wrote:
 Hi!
 
 I just want to check with the community to see if this is normal.
 
 I have used a X4500 with 500Gb disks and I'm not impressed by the copy 
 performance.
 I can run several jobs in parallel and get close to 400mb/s but I need better 
 performance
 from a single copy.  I have tried to be EVIL as well but without success.
 
 Tests done with:
 Solaris 10 U4
 Solaris 10 U5 (B10)
 Nevada B86
 
 *Setup*
 
 # zpool status
  pool: datapool
 state: ONLINE
 scrub: none requested
 config:
 
NAMESTATE READ WRITE CKSUM
datapoolONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t0d0  ONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4t0d0  ONLINE   0 0 0
c6t0d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c7t0d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c4t1d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c7t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c4t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c5t2d0  ONLINE   0 0 0
c6t2d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c7t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c5t3d0  ONLINE   0 0 0
c6t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c7t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c4t4d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c6t4d0  ONLINE   0 0 0
c7t4d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4t5d0  ONLINE   0 0 0
c5t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c6t5d0  ONLINE   0 0 0
c7t5d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
c1t6d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4t6d0  ONLINE   0 0 0
c5t6d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c6t6d0  ONLINE   0 0 0
c7t6d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c1t7d0  ONLINE   0 0 0
 
 *Result*  - Around 50-60mb/s read
 
 parsing profile for config: copyfiles
 Running 
 /tmp/temp165-231.*.*.COM-zfs-readtest-Apr_8_2008-09h_09m_07s/copyfiles/thisrun.f
 FileBench Version 1.2.2
  5109: 0.005: CopyFiles Version 2.3 personality successfully loaded
  5109: 0.005: Creating/pre-allocating files and 

Re: [zfs-discuss] zfs filesystem metadata checksum

2008-04-14 Thread Jeff Bonwick
Not at present, but it's a good RFE.  Unfortunately it won't be
quite as simple as just adding an ioctl to report the dnode checksum.
To see why, consider a file with one level of indirection: that is,
it consists of a dnode, a single indirect block, and several data blocks.
The indirect block contains the checksums of all the data blocks -- handy.
The dnode contains the checksum of the indirect block -- but that's not
so handy, because the indirect block contains more than just checksums;
it also contains pointers to blocks, which are specific to the physical
layout of the data on your machine.  If you did remote replication using
zfs send | ssh elsewhere zfs recv, the dnode checksum on 'elsewhere'
would not be the same.

Jeff

On Tue, Apr 08, 2008 at 01:45:16PM -0700, asa wrote:
 Hello all. I am looking to be able to verify my zfs backups in the  
 most minimal way, ie without having to md5 the whole volume.
 
 Is there a way to get a checksum for a snapshot and compare it to  
 another zfs volume, containing all the same blocks and verify they  
 contain the same information? Even when I destroy the snapshot on the  
 source?
 
 kind of like:
 
 zfs create tank/myfs
 dd if=/dev/urandom bs=128k count=1000 of=/tank/myfs/TESTFILE
 zfs snapshot tank/[EMAIL PROTECTED]
 zfs send tank/[EMAIL PROTECTED] | zfs recv tank/myfs_BACKUP
 
 zfs destroy tank/[EMAIL PROTECTED]
 
 zfs snapshot tank/[EMAIL PROTECTED]
 
 
 someCheckSumVodooFunc(tank/myfs)
 someCheckSumVodooFunc(tank/myfs_BACKUP)
 
 is there some zdb hackery which results in a metadata checksum usable  
 in this scenario?
 
 Thank you all!
 
 Asa
 zfs worshiper
 Berkeley, CA
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Per filesystem scrub

2008-04-05 Thread Jeff Bonwick
 Aye,  or better yet -- give the scrub/resilver/snap reset issue fix very
 high priority.   As it stands snapshots are impossible when you need to
 resilver and scrub (even on supposedly sun supported thumper configs).

No argument.  One of our top engineers is working on this as we speak.
I say we all buy him a drink when he integrates the fix.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Per filesystem scrub

2008-03-31 Thread Jeff Bonwick
Peter,

That's a great suggestion.  And as fortune would have it, we have the
code to do it already.  Scrubbing in ZFS is driven from the logical
layer, not the physical layer.  When you scrub a pool, you're really
just scrubbing the pool-wide metadata, then scrubbing each filesystem.

At 50,000 feet, it's as simple as adding a zfs(1M) scrub subcommand
and having it invoke the already-existing DMU traverse interface.

Closer to ground, there are a few details to work out -- we need an
option to specify whether to include snapshots, whether to descend
recursively (in the case of nested filesystems), and how to handle
branch points (which are created by clones).  Plus we need some way
to name the MOS (meta-object set, which is where we keep all pool
metadata) so you can ask to scrub only that.

Sounds like a nice tidy project for a summer intern!

Jeff

On Sat, Mar 29, 2008 at 05:14:20PM +, Peter Tribble wrote:
 A brief search didn't show anything relevant, so here
 goes:
 
 Would it be feasible to support a scrub per-filesystem
 rather than per-pool?
 
 The reason is that on a large system, a scrub of a pool can
 take excessively long (and, indeed, may never complete).
 Running a scrub on each filesystem allows it to be broken
 up into smaller chunks, which would be much easier to
 arrange. (For example, I could scrub one filesystem a
 night and not have it run into working hours.)
 
 Another reason might be that I have both busy and
 quiet filesystems. For the busy ones, they're regularly
 backed up, and the data regularly read anyway; for the
 quiet ones they're neither read nor backed up, so it
 would be nice to be able to validate those.
 
 -- 
 -Peter Tribble
 http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance lower than expected

2008-03-26 Thread Jeff Bonwick
 The disks in the SAN servers were indeed striped together with Linux LVM
 and exported as a single volume to ZFS.

That is really going to hurt.  In general, you're much better off
giving ZFS access to all the individual LUNs.  The intermediate
LVM layer kills the concurrency that's native to ZFS.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

2008-03-02 Thread Jeff Bonwick
Nathan: yes.  Flipping each bit and recomputing the checksum is not only
possible, we actually did it in early versions of the code.  The problem
is that it's really expensive.  For a 128K block, that's a million bits,
so you have to re-run the checksum a million times, on 128K of data.
That's 128GB of data to churn through.

So Bob: you're right too.  It's generally much cheaper to retry the I/O,
try another disk, try a ditto block, etc.  That said, when all else fails,
a 128GB computation is a lot cheaper than a restore from tape.  At some
point it becomes a bit philosophical.  Suppose the block in question is
a single user data block.  How much of the machine should you be willing
to dedicate to getting that block back?  I mean, suppose you knew that
it was theoretically possible, but would consume 500 hours of CPU time
during which everything else would be slower -- and the affected app's
read() system call would hang for 500 hours.  What is the right policy?
There's no one right answer.  If we were to introduce a feature like this,
we'd need some admin-settable limit on how much time to dedicate to it.

For some checksum functions like fletcher2 and fletcher4, it is possible
to do much better than brute force because you can compute an incremental
update -- that is, you can compute the effect of changing the nth bit
without rerunning the entire checksum.  This is, however, not possible
with SHA-256 or any other secure hash.

We ended up taking that code out because single-bit errors didn't seem
to arise in practice, and in testing, the error correction had a rather
surprising unintended side effect: it masked bugs in the code!

The nastiest kind of bug in ZFS is something we call a future leak,
which is when some change from txg (transaction group) 37 ends up
going out as part of txg 36.  It normally wouldn't matter, except if
you lost power before txg 37 was committed to disk.  On reboot you'd
have inconsistent on-disk state (all of 36 plus random bits of 37).
We developed coding practices and stress tests to catch future leaks,
and as I know we've never actually shipped one.  But they are scary.

If you *do* have a future leak, it's not uncommon for it to be a very
small change -- perhaps incrementing a counter in some on-disk structure.
The thing is, if the counter is going from even to odd, that's exactly
a one-bit change.  The single-bit error correction logic would happily
detect these and fix them up -- not at all what you want when testing!
(Of course, we could turn it off during testing -- but then we wouldn't
be testing it.)

All that said, I'm still occasionally tempted to bring it back.
It may become more relevant with flash memory as a storage medium.

Jeff

On Sun, Mar 02, 2008 at 05:28:48PM -0600, Bob Friesenhahn wrote:
 On Mon, 3 Mar 2008, Nathan Kroenert wrote:
  Speaking of expensive, but interesting things we could do -
 
  From the little I know of ZFS's checksum, it's NOT like the ECC
  checksum we use in memory in that it's not something we can use to
  determine which bit flipped in the event that there was a single bit
  flip in the data. (I could be completely wrong here... but...)
 
 It seems that the emphasis on single-bit errors may be misplaced.  Is 
 there evidence which suggests that single-bit errors are much more 
 common than multiple bit errors?
 
  What is the chance we could put a little more resilience into ZFS such
  that if we do get a checksum error, we systematically flip each bit in
  sequence and check the checksum to see if we could in fact proceed
  (including writing the data back correctly.).
 
 It is easier to retry the disk read another 100 times or store the 
 data in multiple places.
 
  Or build into the checksum something analogous to ECC so we can choose
  to use NON-ZFS protected disks and paths, but still have single bit flip
  protection...
 
 Disk drives commonly use an algorithm like Reed Solomon 
 (http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which 
 provides forward-error correction.  This is done in hardware.  Doing 
 the same in software is likely to be very slow.
 
  What do others on the list think? Do we have enough folks using ZFS on
  HDS / EMC / other hardware RAID(X) environments that might find this useful?
 
 It seems that since ZFS is intended to support extremely large storage 
 pools, available energy should be spent ensuring that the storage pool 
 remains healthy or can be repaired.  Loss of individual file blocks is 
 annoying, but loss of entire storage pools is devastating.
 
 Since raw disk is cheap (and backups are expensive), it makes sense to 
 write more redundant data rather than to minimize loss through exotic 
 algorithms.  Even if RAID is not used, redundant copies may be used on 
 the same disk to help protect against block read errors.
 
 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,

Re: [zfs-discuss] Cause for data corruption?

2008-02-29 Thread Jeff Bonwick
 I thought RAIDZ would correct data errors automatically with the parity data.

Right.  However, if the data is corrupted while in memory (e.g. on a PC
with non-parity memory), there's nothing ZFS can do to detect that.
I mean, not even theoretically.  The best we could do would be to
narrow the windows of vulnerability by recomputing the checksum
every time we accessed an in-memory object, which would be terribly
expensive.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] moving zfs filesystems between disks

2008-02-27 Thread Jeff Bonwick
Yes.  Just say this:

# zpool replace mypool disk1 disk2

This will do all the intermediate steps you'd expect: attach disk2
as a mirror of disk1, resilver, detach disk2, and grow the pool
to reflect the larger size of disk1.

Jeff

On Wed, Feb 27, 2008 at 04:48:59PM -0800, Bill Shannon wrote:
 I've just started using zfs.  I copied data from a ufs filesystem on
 disk 1 to a zfs pool/filesystem on disk 2.  Can I add disk 1 as a mirror
 for disk 2, and then remove disk 2 from the mirror, and end up with all
 the data back on disk 1 in zfs (after some amount of time, of course)?
 If disk 1 is larger than disk 2, will the larger amount of space be
 available after I remove the disk 2 mirror?
 
 (Disk 2 is a full disk, but disk 1 is actually just a partition of a
 disk.  I assume that doesn't make any difference.)
 
 Thanks.
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] moving zfs filesystems between disks

2008-02-27 Thread Jeff Bonwick
Oops -- I transposed 1 and 2 in the last sentence.  Corrected version,
and hopefully a bit easier to read:

# zpool replace mypool olddisk newdisk

This will do all the intermediate steps you'd expect: attach newdisk
as a mirror of olddisk, resilver, detach olddisk, and grow the pool
to reflect the larger size of newdisk.

Jeff

On Wed, Feb 27, 2008 at 05:04:02PM -0800, Jeff Bonwick wrote:
 Yes.  Just say this:
 
 # zpool replace mypool disk1 disk2
 
 This will do all the intermediate steps you'd expect: attach disk2
 as a mirror of disk1, resilver, detach disk2, and grow the pool
 to reflect the larger size of disk1.
 
 Jeff
 
 On Wed, Feb 27, 2008 at 04:48:59PM -0800, Bill Shannon wrote:
  I've just started using zfs.  I copied data from a ufs filesystem on
  disk 1 to a zfs pool/filesystem on disk 2.  Can I add disk 1 as a mirror
  for disk 2, and then remove disk 2 from the mirror, and end up with all
  the data back on disk 1 in zfs (after some amount of time, of course)?
  If disk 1 is larger than disk 2, will the larger amount of space be
  available after I remove the disk 2 mirror?
  
  (Disk 2 is a full disk, but disk 1 is actually just a partition of a
  disk.  I assume that doesn't make any difference.)
  
  Thanks.
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raidz2 resilience on 3 disks

2008-02-21 Thread Jeff Bonwick
 1) If i create a raidz2 pool on some disks, start to use it, then the disks'
 controllers change. What will happen to my zpool? Will it be lost or is
 there some disk tagging which allows zfs to recognise the disks?

It'll be fine.  ZFS opens by path, but then checks both the devid and
the on-disk vdev label, which is dispositive when the others disagree.

 2) if i create a raidz2 on 3 HDs, do i have any resilience? If any one of
 those drives fails, do i loose everything? I've got one such pool and i'm
 afraid it's a ticking time bomb.

You're fine.  RAID-Z2 is N+2, and you have N=1.  A three-way mirror would
give you better performance (because there's no parity to generate),
but from a resilience standpoint they're equivalent.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lost intermediate snapshot; incremental backup still possible?

2008-02-12 Thread Jeff Bonwick
I think so.  On your backup pool, roll back to the last snapshot that
was successfully received.  Then you should be able to send an incremental
between that one and the present.

Jeff

On Thu, Feb 07, 2008 at 08:38:38AM -0800, Ian wrote:
 I keep my system synchronized to a USB disk from time to time.  The script 
 works by sending incremental snapshots to a pool on the USB disk, then 
 deleting those snapshots from the source machine.
 
 A botched script ended up deleting a snapshot that was not successfully 
 received on the USB disk.  Now, I've lost the ability to send incrementally 
 since the intermediate snapshot is lost.  From what I gather, if I try to 
 send a full snapshot, it will require deleting and replacing the dataset on 
 the USB disk.  Is there any way around this?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue fixing ZFS corruption

2008-01-23 Thread Jeff Bonwick
The Silicon Image 3114 controller is known to corrupt data.
Google for silicon image 3114 corruption to get a flavor.
I'd suggest getting your data onto different h/w, quickly.

Jeff

On Wed, Jan 23, 2008 at 12:34:56PM -0800, Bertrand Sirodot wrote:
 Hi,
 
 I have been experiencing corruption on one of my ZFS pool over the last 
 couple of days. I have tried running zpool scrub on the pool, but everytime 
 it comes back with new files being corrupted. I would have thought that zpool 
 scrub would have identified the corrupted files once and for all and would be 
 fine afterwards. The feeling I have right now is that zpool scrub is actually 
 spreading the corruption and won't stop until I have no more files on the 
 file systems. 
 
 I am running 5.11 snv_60 on an Asus M2A VM motherboard. I am using both the 
 SATA controller on the motherboard and a Si3114 based controller. I have had 
 the Si3114 controller for a couple of years now with no issue, that I know of.
 
 Any idea? I was trying to salvage the situation, but it looks like I am going 
 to have to destroy the pool and recreate it.
 
 Thanks a lot in advance,
 Bertrand.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Issue fixing ZFS corruption

2008-01-23 Thread Jeff Bonwick
Actually s10_72, but it's not really a fix, it's a workaround
for a bug in the hardware.  I don't know how effective it is.

Jeff

On Wed, Jan 23, 2008 at 04:54:54PM -0800, Erast Benson wrote:
 I believe issue been fixed in snv_72+, no?
 
 On Wed, 2008-01-23 at 16:41 -0800, Jeff Bonwick wrote:
  The Silicon Image 3114 controller is known to corrupt data.
  Google for silicon image 3114 corruption to get a flavor.
  I'd suggest getting your data onto different h/w, quickly.
  
  Jeff
  
  On Wed, Jan 23, 2008 at 12:34:56PM -0800, Bertrand Sirodot wrote:
   Hi,
   
   I have been experiencing corruption on one of my ZFS pool over the last 
   couple of days. I have tried running zpool scrub on the pool, but 
   everytime it comes back with new files being corrupted. I would have 
   thought that zpool scrub would have identified the corrupted files once 
   and for all and would be fine afterwards. The feeling I have right now is 
   that zpool scrub is actually spreading the corruption and won't stop 
   until I have no more files on the file systems. 
   
   I am running 5.11 snv_60 on an Asus M2A VM motherboard. I am using both 
   the SATA controller on the motherboard and a Si3114 based controller. I 
   have had the Si3114 controller for a couple of years now with no issue, 
   that I know of.
   
   Any idea? I was trying to salvage the situation, but it looks like I am 
   going to have to destroy the pool and recreate it.
   
   Thanks a lot in advance,
   Bertrand.


   This message posted from opensolaris.org
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 recommendations for netbackup dsu?

2007-12-20 Thread Jeff Bonwick
Yep, compression is generally a nice win for backups.  The amount of
compression will depend on the nature of the data.  If it's all mpegs,
you won't see any advantage because they're already compressed.  But
for just about everything else, 2-3x is typical.

As for hot spares, they are indeed global.

Jeff

On Tue, Dec 11, 2007 at 03:16:44PM -0800, Dave Lowenstein wrote:
 Okay, my order for an x4500 went through so sometime soon I'll be using 
 it as a big honkin area for DSUs and DSSUs for netbackup.
 
 Does anybody have any experience with using zfs compression for this 
 purpose? The thought of doubling 48tb to 96 tb is enticing. Are there 
 any other zfs tweaks that might aid in performance for what will pretty 
 much be a lot of long and large reads and writes?
 
 I'm planning on one big chunk of space for a permanently on disk DSU, 
 and another for the DSSU staging areas.
 
 Also, I haven't looked into this but is a spare considered part of a 
 zpool, or is there such a thing as a global spare?
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Roadmap - thoughts on expanding raidz / restriping / defrag

2007-12-17 Thread Jeff Bonwick
In short, yes.  The enabling technology for all of this is something
we call bp rewrite -- that is, the ability to rewrite an existing
block pointer (bp) to a new location.  Since ZFS is COW, this would
be trivial in the absence of snapshots -- just touch all the data.
But because a block may appear in many snapshots, there's more to it.
It's not impossible, just a bit tricky... and we're working on it.

Once we have bp rewrite, many cool features will become available as
trivial applications of it: on-line defrag, restripe, recompress, etc.

Jeff

On Mon, Dec 17, 2007 at 02:29:14AM -0800, Ross wrote:
 Hey folks,
 
 Does anybody know if any of these are on the roadmap for ZFS, or have any 
 idea how long it's likely to be before we see them (we're in no rush - late 
 2008 would be fine with us, but it would be nice to know they're being worked 
 on)?
 
 I've seen many people ask for the ability to expand a raid-z pool by adding 
 devices.  I'm wondering if it would be useful to work on a defrag / 
 restriping tool to work hand in hand with this.
 
 I'm assuming that when the functionality is available, adding a disk to a 
 raid-z set will mean the existing data stays put, and new data is written 
 across a wider stripe.  That's great for performance for new data, but not so 
 good for the existing files.  Another problem is that you can't guarantee how 
 much space will be added.  That will have to be calculated based on how much 
 data you already have.
 
 ie:  If you have a simple raid-z of five 500GB drives, you would expect 
 adding another drive to add 500GB of space.  However, if your pool is half 
 full, you can only make use of 250GB of space, the other 250GB is going to be 
 wasted.
 
 What I would propose to solve this is to implement a defrag / restripe 
 utility as part of the raid-z upgrade process, making it a three step process:
 
  - New drive added to raid-z pool
  - Defrag tool begins restriping and defragmenting old data 
  - Once restripe complete, pool reports the additional free space
 
 There are some limitations to this.  You would maybe want to advise that 
 expanding a raid-z pool should only be done with a reasonable amount of free 
 disk space, and that it may take some time.  It may also be beneficial to add 
 the ability to add multiple disks in one go.
 
 However, if it works it would seem to add several benefits:
  - Raid-z pools can be expanded
  - ZFS gains a defrag tool
  - ZFS gains a restriping tool
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best option for my home file server?

2007-09-26 Thread Jeff Bonwick
I would keep it simple.  Let's call your 250GB disks A, B, C, D,
and your 500GB disks X and Y.  I'd either make them all mirrors:

zpool create mypool mirror A B mirror C D mirror X Y

or raidz the little ones and mirror the big ones:

zpool create mypool raidz A B C D mirror X Y

or, as you mention, get another 500GB disk, Z, and raidz like this:

zpool create mypool raidz A B C D raidz X Y Z

Jeff

On Wed, Sep 26, 2007 at 01:06:38PM -0700, Christopher wrote:
 I'm about to build a fileserver and I think I'm gonna use OpenSolaris and ZFS.
 
 I've got a 40GB PATA disk which will be the OS disk, and then I've got 
 4x250GB SATA + 2x500GB SATA disks. From what you are writing I would think my 
 best option would be to slice the 500GB disks in two 250GB and then make two 
 RAIDz with two 250 disks and one partition from each 500 disk, giving me two 
 RAIDz of 4 slices of 250, equaling to 2 x 750GB RAIDz.
 
 How would the performance be with this? I mean, it would probably drop since 
 I would have two raidz slices on one disk.
 
 From what I gather, I would still be able to lose one of the 500 disks (or 
 250) and still be able to recover, right?
 
 Perhaps I should just get another 500GB disk and run a RAIDz on the 500s and 
 one RAIDz on the 250s?
 
 I'm also a bit of a noob when it comes to ZFS (but it looks like it's not 
 that hard to admin) - Would I be able to join the two RAIDz together for one 
 BIG volume altogether? And it will survive one disk failure?
 
 /Christopher
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS panic when trying to import pool

2007-09-18 Thread Jeff Bonwick
Basically, it is complaining that there aren't enough disks to read
the pool metadata.  This would suggest that in your 3-disk RAID-Z
config, either two disks are missing, or one disk is missing *and*
another disk is damaged -- due to prior failed writes, perhaps.

(I know there's at least one disk missing because the failure mode
is errno 6, which is EXNIO.)

Can you tell from /var/adm/messages or fmdump whether there write
errors to multiple disks, or to just one?

Jeff

On Tue, Sep 18, 2007 at 05:26:16PM -0700, Geoffroy Doucet wrote:
 I have a raid-z zfs filesystem with 3 disks. The disk was starting have read 
 and write errors.
 
 The disks was so bad that I started to have trans_err. The server lock up and 
 the server was reset. Then now when trying to import the pool the system 
 panic.
 
 I installed the last Recommend on my Solaris U3 and also install the last 
 Kernel patch (120011-14).
 
 But still when trying to do zpool import pool it panic.
 
 I also dd the disk and tested on another server with OpenSolaris B72 and 
 still the same thing. Here is the panic backtrace:
 
 Stack Backtrace
 -
 vpanic()
 assfail3+0xb9(f7dde5f0, 6, f7dde840, 0, f7dde820, 153)
 space_map_load+0x2ef(ff008f1290b8, c00fc5b0, 1, ff008f128d88,
 ff008dd58ab0)
 metaslab_activate+0x66(ff008f128d80, 8000)
 metaslab_group_alloc+0x24e(ff008f46bcc0, 400, 3fd0f1, 32dc18000,
 ff008fbeaa80, 0)
 metaslab_alloc_dva+0x192(ff008f2d1a80, ff008f235730, 200,
 ff008fbeaa80, 0, 0)
 metaslab_alloc+0x82(ff008f2d1a80, ff008f235730, 200, 
 ff008fbeaa80, 2
 , 3fd0f1)
 zio_dva_allocate+0x68(ff008f722790)
 zio_next_stage+0xb3(ff008f722790)
 zio_checksum_generate+0x6e(ff008f722790)
 zio_next_stage+0xb3(ff008f722790)
 zio_write_compress+0x239(ff008f722790)
 zio_next_stage+0xb3(ff008f722790)
 zio_wait_for_children+0x5d(ff008f722790, 1, ff008f7229e0)
 zio_wait_children_ready+0x20(ff008f722790)
 zio_next_stage_async+0xbb(ff008f722790)
 zio_nowait+0x11(ff008f722790)
 dmu_objset_sync+0x196(ff008e4e5000, ff008f722a10, ff008f260a80)
 dsl_dataset_sync+0x5d(ff008df47e00, ff008f722a10, ff008f260a80)
 dsl_pool_sync+0xb5(ff00882fb800, 3fd0f1)
 spa_sync+0x1c5(ff008f2d1a80, 3fd0f1)
 txg_sync_thread+0x19a(ff00882fb800)
 thread_start+8()
 
 
 
 And here is the panic message buf:
 panic[cpu0]/thread=ff0001ba2c80:
 assertion failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 
 (0
 x6 == 0x0), file: ../../common/fs/zfs/space_map.c, line: 339
 
 
 ff0001ba24f0 genunix:assfail3+b9 ()
 ff0001ba2590 zfs:space_map_load+2ef ()
 ff0001ba25d0 zfs:metaslab_activate+66 ()
 ff0001ba2690 zfs:metaslab_group_alloc+24e ()
 ff0001ba2760 zfs:metaslab_alloc_dva+192 ()
 ff0001ba2800 zfs:metaslab_alloc+82 ()
 ff0001ba2850 zfs:zio_dva_allocate+68 ()
 ff0001ba2870 zfs:zio_next_stage+b3 ()
 ff0001ba28a0 zfs:zio_checksum_generate+6e ()
 ff0001ba28c0 zfs:zio_next_stage+b3 ()
 ff0001ba2930 zfs:zio_write_compress+239 ()
 ff0001ba2950 zfs:zio_next_stage+b3 ()
 ff0001ba29a0 zfs:zio_wait_for_children+5d ()
 ff0001ba29c0 zfs:zio_wait_children_ready+20 ()
 ff0001ba29e0 zfs:zio_next_stage_async+bb ()
 ff0001ba2a00 zfs:zio_nowait+11 ()
 ff0001ba2a80 zfs:dmu_objset_sync+196 ()
 ff0001ba2ad0 zfs:dsl_dataset_sync+5d ()
 ff0001ba2b40 zfs:dsl_pool_sync+b5 ()
 ff0001ba2bd0 zfs:spa_sync+1c5 ()
 ff0001ba2c60 zfs:txg_sync_thread+19a ()
 ff0001ba2c70 unix:thread_start+8 ()
 
 syncing file systems...
 
 
 Is there a way to restore the data? Is there a way to fsck the zpool, and 
 correct the error manually?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-11 Thread Jeff Bonwick
 As you can see, two independent ZFS blocks share one parity block.
 COW won't help you here, you would need to be sure that each ZFS
 transaction goes to a different (and free) RAID5 row.
 
 This is I belive the main reason why poor RAID5 wasn't used in the first
 place.

Exactly right.  RAID-Z has different performance trade-offs than RAID-5,
but the deciding factor was correctness.

I'm really glad you're doing these experiments!  It's good to know what
the trade-offs are, performance-wise, between RAID-Z and classic RAID-5.
At a minimum, it tells us what's on the table, and what we're paying for
transactional semantics.  To be honest, I'm pleased that it's only 2x.
It wouldn't have surprised me if it were Nx for an N+1 configuration.
A factor of 2 is something we can work with.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mysterious corruption with raidz2 vdev

2007-07-30 Thread Jeff Bonwick
I suspect this is a bug in raidz error reporting.  With a mirror,
each copy either checksums correctly or it doesn't, so we know
which drives gave us bad data.  With RAID-Z, we have to infer
which drives have damage.  If the number of drives returning bad
data is less than or equal to the number of parity drives, we can
both detect and correct the error.  But if, say, three drives in
a RAID-Z2 stripe return corrupt data, we have no way to know which
drives are at fault -- there's just not enough information, and I
mean that in the mathematical sense (fewer equations than unknowns).

That said, we should enhance 'zpool status' to indicate the number
of detected-but-undiagnosable errors on each RAID-Z vdev.

Jeff

Kevin wrote:
 We'll try running all of the diagnostic tests to rule out any other issues.
 
 But my question is, wouldn't I need to see at least 3 checksum errors on the 
 individual devices in order for there to be a visible error in the top level 
 vdev? There doesn't appear to be enough raw checksum errors on the disks for 
 there to have been 3 errors in the same vdev block. Or am I not understanding 
 the checksum count correctly?
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS raid is very slow???

2007-07-06 Thread Jeff Bonwick
A couple of questions for you:

(1) What OS are you running (Solaris, BSD, MacOS X, etc)?

(2) What's your config?  In particular, are any of the partitions
on the same disk?

(3) Are you copying a few big files or lots of small ones?

(4) Have you measured UFS-to-UFS and ZFS-to-ZFS performance on the
same platform?  That'd be useful data...

Jeff

On Fri, Jul 06, 2007 at 03:49:43PM -0400, Will Murnane wrote:
 On 7/6/07, Orvar Korvar [EMAIL PROTECTED] wrote:
  have set up a ZFS raidz with 4 samsung 500GB hard drives.
 
  It is extremely slow when I mount a ntfs partition and copy everything to 
  zfs. Its
  like 100kb/sec or less. Why is that?
 How are you mounting said NTFS partition?
 
  When I copy from ZFSpool to UFS, I get like 40MB/sec - isnt it very low
  considering I have 4 new 500GB discs in raid? And when I copy from UFS to 
  ZPool
  I get like 20MB/sec. Strange? Or normal results? Should I expect better
  performance? As of now, I am disappointed of ZFS.
 How fast is copying a file from ZFS to /dev/null?  That would
 eliminate the UFS disk from the mix.
 
 Will
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: zfs reports small st_size for directories?

2007-06-09 Thread Jeff Bonwick
 What was the reason to make ZFS use directory sizes as the number of
 entries rather than the way other Unix filesystems use it?

In UFS, the st_size is the size of the directory inode as though it
were a file.  The only reason it's like that is that UFS is sloppy
and lets you cat directories -- a fine way to screw up your terminal
settings, but otherwise not terribly useful.  For reads (rather than
readdirs) of a directory to work, st_size has to be this way.

With ZFS, we decided to enforce file vs. directory semantics -- no
read(2) of directories, no directory hard links (even as root), etc.

What, then, should we return for st_size?  We figured the number of
entries would be the most useful piece of information for a sysadmin.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Multiple filesystem costs? Directory sizes?

2007-05-01 Thread Jeff Bonwick

Mario,

For the reasons you mentioned, having a few different filesystems
(on the order of 5-10, I'd guess) can be handy.  Any time you want
different behavior for different types of data, multiple filesystems
are the way to go.

For maximum directory size, it turns out that the practical limits
aren't in ZFS -- they're in your favorite applications, like ls(1)
and file browsers.  ZFS won't mind if you put millions of files
in a directory, but ls(1) will be painfully slow.  Similarly, if
you're using a mail program and you go to a big directory to grab
an attachment... you'll wait and wait while it reads the first few
bytes of every file in the directory to determine its type.

Hope that helps,

Jeff

Mario Goebbels wrote:

While setting up my new system, I'm wondering whether I should go with plain 
directories or use ZFS filesystems for specific stuff. About the cost of ZFS 
filesystems, I read on some Sun blog in the past about something like 64k 
kernel memory (or whatever) per active filesystem. What are however the 
additional costs?

The reason I'm considering multiple filesystems is for instance easy ZFS 
backups and snapshots, but also tuning the recordsizes. Like storing lots of 
generic pictures from the web, smaller recordsizes may be appropriate to trim 
down the waste once the filesize surpasses the record size, aswell as using 
large recordsizes for video files on a seperate filesystem. Turning on and off 
compression and access times for performance reasons are another thing.

Also, in this same message, I'd like to ask what sensible maximum directory 
sizes are. As in amount of files.

Thanks.
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS stalling problem

2007-03-04 Thread Jeff Bonwick

Jesse,

This isn't a stall -- it's just the natural rhythm of pushing out
transaction groups.  ZFS collects work (transactions) until either
the transaction group is full (measured in terms of how much memory
the system has), or five seconds elapse -- whichever comes first.

Your data would seem to suggest that the read side isn't delivering
data as fast as ZFS can write it.  However, it's possible that
there's some sort of 'breathing' effect that's hurting performance.
One simple experiment you could try: patch txg_time to 1.  That
will cause ZFS to push transaction groups every second instead of
the default of every 5 seconds.  If this helps (or if it doesn't),
please let us know.

Thanks,

Jeff

Jesse DeFer wrote:

Hello,

I am having problems with ZFS stalling when writing, any help in 
troubleshooting would be appreciated.  Every 5 seconds or so the write 
bandwidth drops to zero, then picks up a few seconds later (see the zpool 
iostat at the bottom of this message).  I am running SXDE, snv_55b.

My test consists of copying a 1gb file (with cp) between two drives, one 80GB 
PATA, one 500GB SATA.  The first drive is the system drive (UFS), the second is 
for data.  I have configured the data drive with UFS and it does not exhibit 
the stalling problem and it runs in almost half the time.  I have tried many 
different ZFS settings as well: atime=off, compression=off, checksums=off, 
zil_disable=1 all to no effect.  CPU jumps to about 25% system time during the 
stalls, and hovers around 5% when data is being transferred.

# zpool iostat 1
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
tank 183M   464G  0 17  1.12K  1.93M
tank 183M   464G  0457  0  57.2M
tank 183M   464G  0445  0  55.7M
tank 183M   464G  0405  0  50.7M
tank 366M   464G  0226  0  4.97M
tank 366M   464G  0  0  0  0
tank 366M   464G  0  0  0  0
tank 366M   464G  0  0  0  0
tank 366M   464G  0200  0  25.0M
tank 366M   464G  0431  0  54.0M
tank 366M   464G  0445  0  55.7M
tank 366M   464G  0423  0  53.0M
tank 574M   463G  0270  0  18.1M
tank 574M   463G  0  0  0  0
tank 574M   463G  0  0  0  0
tank 574M   463G  0  0  0  0
tank 574M   463G  0164  0  20.5M
tank 574M   463G  0504  0  63.1M
tank 574M   463G  0405  0  50.7M
tank 753M   463G  0404  0  42.6M
tank 753M   463G  0  0  0  0
tank 753M   463G  0  0  0  0
tank 753M   463G  0  0  0  0
tank 753M   463G  0343  0  42.9M
tank 753M   463G  0476  0  59.5M
tank 753M   463G  0465  0  50.4M
tank 907M   463G  0 68  0   390K
tank 907M   463G  0  0  0  0
tank 907M   463G  0 11  0  1.40M
tank 907M   463G  0451  0  56.4M
tank 907M   463G  0492  0  61.5M
tank1.01G   463G  0139  0  7.94M
tank1.01G   463G  0  0  0  0

Thanks,
Jesse DeFer
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored

2007-03-01 Thread Jeff Bonwick
 However, I logged in this morning to discover that the ZFS volume could
 not be read. In addition, it appears to have marked all drives, mirrors
  the volume itself as 'corrupted'.

One possibility: I've seen this happen when a system doesn't shut down
cleanly after the last change to the pool configuration.  In this case,
what can happen is that the boot archive (an annoying implementation
detail of the new boot architecture) can be out of date relative to
your pool.  In particular, the stale boot archive may contain an old
version of /etc/zfs/zpool.cache, which confuses the initial pool open.

The workaround for this is simple enough: export the pool and then
import it.  Assuming this works, you can fix the stupid boot archive
is by running 'bootadm update-archive'.  Please let us know if this helps.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does running redundancy with ZFS use as much disk space as doubling drives?

2007-02-26 Thread Jeff Bonwick
On Mon, Feb 26, 2007 at 01:53:17AM -0800, Tor wrote:
 [...] if using redundancy on ZDF

The ZFS Document Format?  ;-)

 uses less disk space as simply getting extra drives and do identical copies,
 with periodic CRC checks of the source material to check the health.

If you create a 2-disk mirror, then it is indeed simply two copies.
But if you create, say, a 5-disk RAID-Z group, then you get 4 data
disks worth of space.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Jeff Bonwick

Do you agree that their is a major tradeoff of
builds up a wad of transactions in memory?


I don't think so.  We trigger a transaction group commit when we
have lots of dirty data, or 5 seconds elapse, whichever comes first.
In other words, we don't let updates get stale.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Jeff Bonwick

That is interesting. Could this account for disproportionate kernel
CPU usage for applications that perform I/O one byte at a time, as
compared to other filesystems? (Nevermind that the application
shouldn't do that to begin with.)


No, this is entirely a matter of CPU efficiency in the current code.
There are two issues; we know what they are; and we're fixing them.

The first is that as we translate from znode to dnode, we throw away
information along the way -- we go from znode to object number (fast),
but then we have to do an object lookup to get from object number to
dnode (slow, by comparison -- or more to the point, slow relative to
the cost of writing a single byte).  But this is just stupid, since
we already have a dnode pointer sitting right there in the znode.
We just need to fix our internal interfaces to expose it.

The second problem is that we're not very fast at partial-block
updates.  Again, this is entirely a matter of code efficiency,
not anything fundamental.


I still would love to see something like fbarrier() defined by some
standrd (de facto or otherwise) to make the distinction between
ordered writes and guaranteed persistence more easily exploited in the
general case for applications, and encourage filesystems/storage
systems to optimize for that case (i.e., not have fbarrier() simply
fsync()).


Totally agree.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS vs NFS vs array caches, revisited

2007-02-11 Thread Jeff Bonwick
 [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times 
 faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b]

Without knowing more I can only guess, but most likely it's a simple
matter of working set.  Suppose the benchmark in question has a 4G
working set, and suppose that each LUN is fronted by a 1G cache.
With a single LUN, only 1/4 of your working set fits in cache,
so you're doing a fair amount of actual disk I/O.  With 7 LUNs,
you've got 7G of cache, so the entire benchmark fits in cache --
no disk I/O.  The factor of 100x is what tells me this is almost
certainly a working-set effect.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruption -- odd inum?

2007-02-11 Thread Jeff Bonwick
The object number is in hex.  21e282 hex is 2220674 decimal --
give that a whirl.

This is all better now thanks to some recent work by Eric Kustarz:

6410433 'zpool status -v' would be more useful with filenames

This was integrated into Nevada build 57.

Jeff

On Sat, Feb 10, 2007 at 05:18:05PM -0800, Joe Little wrote:
 So, I attempting to find the inode from the result of a zpool status -v:
 
 errors: The following persistent errors have been detected:
 
  DATASET  OBJECT  RANGE
  cc   21e382  lvl=0 blkid=0
 
 
 Well, 21e282 appears to not be a valid number for find . -inum blah
 
 Any suggestions?
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs rewrite?

2007-01-26 Thread Jeff Bonwick
On Fri, Jan 26, 2007 at 10:57:19PM -0800, Frank Cusack wrote:
 On January 27, 2007 12:27:17 AM -0200 Toby Thain [EMAIL PROTECTED] wrote:
 On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote:
 3. I created file system with huge amount of data, where most of the
 data is read-only. I change my server from intel to sparc64 machine.
 Adaptive endianess only change byte order to native on write and
 because
 file system is mostly read-only, it'll need to byteswap all the time.
 And here comes 'zfs rewrite'!
 
 Why would this help? (Obviously file data is never 'swapped').
 
 Metadata (incl checksums?) still has to be byte-swapped.  Or would
 atime updates also force a metadata update?  Or am I totally mistaken.

You're all correct.  File data is never byte-swapped.  Most metadata
needs to be byte-swapped, but it's generally only 1-2% of your space.
So the overhead shouldn't be significant, even if you never rewrite.

An atime update will indeed cause a znode rewrite (unless you run
with zfs set atime=off), so znodes will get rewritten by reads.

The only other non-trivial metadata is the indirect blocks.
All files up to 128k are stored in a single block: ZFS has
variable blocksize from 512 bytes to 128k, so a 35k file consumes
exactly 35k (not, say, 40k as it would with a fixed 8k blocksize).
Single-block files have no indirect blocks, and hence no metadata
other than the znode.  So all that remains is the indirect blocks
for files larger than 128k -- which is to say, not very much.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] File Space Allocation

2006-11-04 Thread Jeff Bonwick
 Where can I find information on the file allocation methodology used by ZFS?

You've inspired me to blog again:

http://blogs.sun.com/bonwick/entry/zfs_block_allocation

I'll describe the way we manage free space in the next post.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Snapshots impact on performance

2006-10-29 Thread Jeff Bonwick
 Nice, this is definitely pointing the finger more definitively.  Next 
 time could you try:
 
 dtrace -n '[EMAIL PROTECTED](20)] = count()}' -c 'sleep 5'
 
 (just send the last 10 or so stack traces)
 
 In the mean time I'll talk with our SPA experts and see if I can figure 
 out how to fix this...

By any chance is the pool fairly close to full?  The fuller it gets,
the harder it becomes to find long stretches of free space.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Corrupted LUN in RAIDZ group -- How to repair?

2006-09-10 Thread Jeff Bonwick
 It looks like now the scrub has completed.  Should I now clear these warnings?

Yep.  You survived the Unfortunate Event unscathed.  You're golden.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: system unresponsive after issuing a zpool attach

2006-08-17 Thread Jeff Bonwick
 And it started replacement/resilvering... after few minutes system became 
unavailbale. Reboot only gives me a few minutes, then resilvering make system 
unresponsible.
 
 Is there any workaroud or patch for this problem???

Argh, sorry -- the problem is that we don't do aggressive enough
scrub/resilver throttling.  The effect is most pronounced on 32-bit
or low-memory systems.  We're working on it.

One thing you might try is reducing txg_time to 1 second (the default
is 5 seconds) by saying this: echo txg_time/W1 | mdb -kw.

Let me describe what's happening, and why this may help.

When we kick off a scrub (same code path as resilver, so I'll use
the term generically), we traverse the entire block tree looking
for blocks that need scrubbing.  The tree traversal itself is
single-threaded, but the work it generates is not -- each time
we find a block that needs scrubbing, we schedule an async I/O
to do it.  As you've discovered, we can generate work faster than
the I/O subsystem can process it.  To avoid overloading the disks,
we throttle I/O downstream, but we don't (yet) have an upstream
throttle.  If we discover blocks really fast, we can end up
scheduling lots of I/O -- and sitting on lots of memory -- before
the downstream throttle kicks in.

The reason this relates to txg_time is that every time we sync a
transaction group, we suspend the scrub thread and wait for all
pending scrub I/Os to complete.  This ensures that we won't
asynchronously scrub a block that was freed and reallocated
in a future txg; when coupled with the COW nature of ZFS,
this allows us to run scrubs entirely independent of all
filesystem-level structure (e.g. directories) and locking rules.
This little trick makes the scrubbing algorithms *much* simpler.

The key point is that each spa_sync() throttles the scrub to zero.
By lowering txg_time from 5 to 1, you're cutting down the maximum
number of pending scrub I/Os by roughly 5x.  The unresponsiveness
you're seeing is a threshold effect; I'm hoping that by running
spa_sync() more often, we can get you below that threshold.

Please let me know if this works for you.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance using slices vs. entire disk?

2006-08-03 Thread Jeff Bonwick
 ZFS will try to enable write cache if whole disks is given.
 
 Additionally keep in mind that outer region of a disk is much faster.

And it's portable.  If you use whole disks, you can export the
pool from one machine and import it on another.  There's no way
to export just one slice and leave the others behind...

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance using slices vs. entire disk?

2006-08-03 Thread Jeff Bonwick
 is zfs any less efficient with just using a portion of a 
 disk versus the entire disk?

As others mentioned, if we're given a whole disk (i.e. no slice
is specified) then we can safely enable the write cache.

One other effect -- probably not huge -- is that the block placement
algorithm is most optimal for an outer-to-inner track diameter ratio
of about 2:1, which reflects typical platters.  To quote the source:

http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/metaslab.c#m
etaslab_weight

/*
 * Modern disks have uniform bit density and constant angular velocity.
 * Therefore, the outer recording zones are faster (higher bandwidth)
 * than the inner zones by the ratio of outer to inner track diameter,
 * which is typically around 2:1.  We account for this by assigning
 * higher weight to lower metaslabs (multiplier ranging from 2x to 1x).
 * In effect, this means that we'll select the metaslab with the most
 * free bandwidth rather than simply the one with the most free space.
 */

But like I said, the effect isn't huge -- the high-order bit that we
have a preference for low LBAs.  It's a second-order optimization
to bias the allocation based on the maximum free bandwidth, which is
currently based on an assumption about physical disk construction.
In the future we'll do the smart thing and compute each metaslab's
allocation bias based on its actual observed bandwidth.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS performance using slices vs. entire disk?

2006-08-03 Thread Jeff Bonwick
 With all of the talk about performance problems due to
 ZFS doing a sync to force the drives to commit to data
 being on disk, how much of a benefit is this - especially
 for NFS?

It depends.  For some drives it's literally 10x.

 Also, if I was lucky enough to have a working prestoserv
 card around, would ZFS be able to take advantage of
 that at all?

I'm working on the general lack-of-NVRAM-in-servers problem.
As for using presto, I don't think it'd be too hard.  We've
already structured the code so that allocating intent log
blocks from a different set of vdevs would be straightforward.
It's probably a week's work to define the new metaslab class,
new vdev type, and modify the ZIL to use it.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharing a storage array

2006-07-28 Thread Jeff Bonwick
  bonus questions: any idea when hot spares will make it to S10?
 
 good question :)

It'll be in U3, and probably available as patches for U2 as well.
The reason for U2 patches is Thumper (x4500), because we want ZFS
on Thumper to have hot spares and double-parity RAID-Z from day one.

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] persistent errors - which file?

2006-07-27 Thread Jeff Bonwick
 I've a non-mirrored zfs file systems which shows the status below. I saw
 the thread in the archives about working this out but it looks like ZFS
 messages have changed. How do I find out what file(s) this is?
 [...]
 errors: The following persistent errors have been detected:
 
   DATASET  OBJECT  RANGE
   LOCAL28905   3262251008-3262382080

I realize this is a bit lame, but currently the answer is:

find /LOCAL -mount -inum 28905

And yes, we do indeed plan to automate this.  ;-)

Jeff

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >