Re: [zfs-discuss] Dedup - Does on imply sha256?
Correct. Jeff On Aug 24, 2010, at 9:45 PM, Peter Taps wrote: Folks, One of the articles on the net says that the following two commands are exactly the same: # zfs set dedup=on tank # zfs set dedup=sha256 tank Essentially, on is just a pseudonym for sha256 and verify is just a pseudonym for sha256,verify. Can someone please confirm if this is true? Thank you in advance for your help. Regards, Peter -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gang blocks at will?
You can set metaslab_gang_bang to (say) 8k to force lots of gang block allocations. Jeff On May 25, 2010, at 11:42 PM, Andriy Gapon wrote: I am working on improving some ZFS-related bits in FreeBSD boot chain. At the moment it seems that the things work mostly fine except for a case where the boot code needs to read gang blocks. We have some reports from users about failures, but unfortunately their pools are not available for testing anymore and I can not reproduce the issue at will. I am sure that (Open)Solaris GRUB version has been properly tested, including the above environment. Could you please help me with ideas how to create a pool/filesystem/file that would have gang-blocks with high probability? Perhaps, there are some pre-made test pool images available? Or some specialized tool? Thanks a lot! -- Andriy Gapon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool import with failed ZIL device now possible ?
People used fastfs for years in specific environments (hopefully understanding the risks), and disabling the ZIL is safer than fastfs. Seems like it would be a useful ZFS dataset parameter. We agree. There's an open RFE for this: 6280630 zil synchronicity No promise on date, but it will bubble to the top eventually. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compressratio vs. dedupratio
It is by design. The idea is to report the dedup ratio for the data you've actually attempted to dedup. To get a 'diluted' dedup ratio of the sort you describe, just compare the space used by all datasets to the space allocated in the pool. For example, on my desktop, I have a pool called 'builds' with dedup enabled on some datasets: $ zfs get used builds NAMEPROPERTY VALUE SOURCE builds used 81.0G - $ zpool get allocated builds NAMEPROPERTY VALUE SOURCE builds allocated 47.4G - Thus my diluted dedup ratio is 81.0 / 47.4 = 1.71. Jeff On Sat, Dec 12, 2009 at 10:06:49PM +, Robert Milkowski wrote: Hi, The compressratio property seems to be a ratio of compression for a given dataset calculated in such a way so all data in it (compressed or not) is taken into account. The dedupratio property on the other hand seems to be taking into account only dedupped data in a pool. So for example if there is already 1TB of data before dedup=on and then dedup is set to on and 3 small identical files are copied in the dedupratio will be 3. IMHO it is misleading as it suggest that on average a ratio of 3 was achieved in a pool which is not true. Is it by design or is it a bug? If it is by design then having an another property which would give a ratio of dedup in relation to all data in a pool (dedupped or not) would be useful. Example (snv 129): mi...@r600:/rpool/tmp# mkfile 200m file1 mi...@r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 mi...@r600:/rpool/tmp# ls -l /var/adm/messages -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages mi...@r600:/rpool/tmp# cp /var/adm/messages /test/ mi...@r600:/rpool/tmp# sync mi...@r600:/rpool/tmp# zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.00x - mi...@r600:/rpool/tmp# zfs set compression=gzip test mi...@r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 mi...@r600:/rpool/tmp# sync mi...@r600:/rpool/tmp# zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.27x - mi...@r600:/rpool/tmp# zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.24x - mi...@r600:/rpool/tmp# zpool destroy test mi...@r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 mi...@r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTYVALUE SOURCE test dedupratio 1.00x - mi...@r600:/rpool/tmp# cp /var/adm/messages /test/ mi...@r600:/rpool/tmp# sync mi...@r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTYVALUE SOURCE test dedupratio 1.00x - mi...@r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 mi...@r600:/rpool/tmp# sync mi...@r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTYVALUE SOURCE test dedupratio 1.00x - mi...@r600:/rpool/tmp# cp /var/adm/messages /test/messages.2 mi...@r600:/rpool/tmp# sync mi...@r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTYVALUE SOURCE test dedupratio 2.00x - mi...@r600:/rpool/tmp# rm /test/messages mi...@r600:/rpool/tmp# sync mi...@r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTYVALUE SOURCE test dedupratio 2.00x - -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Doing ZFS rollback with preserving later created clones/snapshot?
Yes, although it's slightly indirect: - make a clone of the snapshot you want to roll back to - promote the clone See 'zfs promote' for details. Jeff On Fri, Dec 11, 2009 at 08:37:04AM +0100, Alexander Skwar wrote: Hi. Is it possible on Solaris 10 5/09, to rollback to a ZFS snapshot, WITHOUT destroying later created clones or snapshots? Example: --($ ~)-- sudo zfs snapshot rpool/r...@01 --($ ~)-- sudo zfs snapshot rpool/r...@02 --($ ~)-- sudo zfs clone rpool/r...@02 rpool/ROOT-02 --($ ~)-- LC_ALL=C sudo zfs rollback rpool/r...@01 cannot rollback to 'rpool/r...@01': more recent snapshots exist use '-r' to force deletion of the following snapshots: rpool/r...@02 So it isn't as simple as that. But what needs to be done, to preserve rpool/ROOT-02? Actually, I'm not concerned (that much) with preserving the clone rpool/ROOT-02. But I'd like to keep the contents of rpool/ROOT as it was when I created the @02 snapshot. Is the only possible way to create a backup of rpool/r...@02 (eg. of the snapshot directory /rpool/ROOT/.zfs/snapshots/02) and then restore it later on (eg. backup to tape, backup to someother filesystem using zfs send|recv, rsync, tar, ...)? Thanks a lot, Alexander -- ? Keine Internetzensur in Deutschland! ? http://zensursula.net ? ? ?? ? Lifestream (Twitter, Blog, ?) ??http://alexs77.soup.io/ ? ? ? ? Chat (Jabber/Google Talk) ? a.sk...@gmail.com , AIM: alexws77 ?? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication - deleting the original
i am no pro in zfs, but to my understanding there is no original. That is correct. From a semantic perspective, there is no change in behavior between dedup=off and dedup=on. Even the accounting remains the same: each reference to a block is charged to the dataset making the reference. The only place you see the effect of dedup is at the pool level, which can now have more logical than physical data. You may also see a difference in performance, which can be either positive or negative depending on a whole bunch of factors. At the implementation level, all that's really happening with dedup is that when you write a block whose contents are identical to an existing block, instead of allocating new disk space we just increment a reference count on the existing block. When you free the block (from the dataset's perspective), the storage pool decrements the reference count, but the block remains allocated at the pool level. When the reference count goes to zero, the storage pool frees the block for real (returns it to the storage pool's free space map). But, to reiterate, none of this is visible semantically. The only way you can even tell dedup is happening is to observe that the total space used by all datasets exceeds the space allocated from the pool -- i.e. that the pool's dedup ratio is greater than 1.0. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken
And, for the record, this is my fault. There is an aspect of endianness that I simply hadn't thought of. When I have a little more time I will blog about the whole thing, because there are many useful lessons here. Thank you, Matt, for all your help with this. And my apologies to everyone else for the disruption. Jeff On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote: We discovered another, more fundamental problem with dedup=fletcher4,verify. I've just putback the fix for: 6904243 zpool scrub/resilver doesn't work with cross-endian dedup=fletcher4,verify blocks The same instructions as below apply, but in addition, the dedup=fletcher4,verify functionality has been removed. We will investigate whether it's possible to fix these isses and re-enable this functionality. --matt Matthew Ahrens wrote: If you did not do zfs set dedup=fletcher4,verify fs (which is available in build 128 and nightly bits since then), you can ignore this message. We have changed the on-disk format of the pool when using dedup=fletcher4,verify with the integration of: 6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots of hash collisions This is not the default dedup setting; pools that only used zfs set dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected. Before installing bits with this fix, you will need to destroy any filesystems that have had dedup=fletcher4,verify set on them. You can preserve your existing data by running: zfs set dedup=any other setting old fs zfs snapshot -r old fs@snap zfs create new fs zfs send -R old fs@snap | zfs recv -d new fs zfs destroy -r old fs Simply changing the setting from dedup=fletcher4,verify to another setting is not sufficient, as this does not modify existing data. You can verify that your pool isn't using dedup=fletcher4,verify by running zdb -D pool | grep DDT-fletcher4 If there are no matches, your pool is not using dedup=fletcher4,verify, and it is safe to install bits with this fix. Build 128 will be respun to include this fix. Sorry for the inconvenience, -- team zfs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] heads-up: dedup=fletcher4,verify was broken
Finally, just to be clear, one last point: the two fixes integrated today only affect you if you've explicitly set dedup=fletcher4,verify. To quote Matt: This is not the default dedup setting; pools that only used zfs set dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected. Jeff On Mon, Nov 23, 2009 at 09:44:41PM -0800, Jeff Bonwick wrote: And, for the record, this is my fault. There is an aspect of endianness that I simply hadn't thought of. When I have a little more time I will blog about the whole thing, because there are many useful lessons here. Thank you, Matt, for all your help with this. And my apologies to everyone else for the disruption. Jeff On Mon, Nov 23, 2009 at 09:15:48PM -0800, Matthew Ahrens wrote: We discovered another, more fundamental problem with dedup=fletcher4,verify. I've just putback the fix for: 6904243 zpool scrub/resilver doesn't work with cross-endian dedup=fletcher4,verify blocks The same instructions as below apply, but in addition, the dedup=fletcher4,verify functionality has been removed. We will investigate whether it's possible to fix these isses and re-enable this functionality. --matt Matthew Ahrens wrote: If you did not do zfs set dedup=fletcher4,verify fs (which is available in build 128 and nightly bits since then), you can ignore this message. We have changed the on-disk format of the pool when using dedup=fletcher4,verify with the integration of: 6903705 dedup=fletcher4,verify doesn't byteswap correctly, has lots of hash collisions This is not the default dedup setting; pools that only used zfs set dedup=on (or =sha256, or =verify, or =sha256,verify) are unaffected. Before installing bits with this fix, you will need to destroy any filesystems that have had dedup=fletcher4,verify set on them. You can preserve your existing data by running: zfs set dedup=any other setting old fs zfs snapshot -r old fs@snap zfs create new fs zfs send -R old fs@snap | zfs recv -d new fs zfs destroy -r old fs Simply changing the setting from dedup=fletcher4,verify to another setting is not sufficient, as this does not modify existing data. You can verify that your pool isn't using dedup=fletcher4,verify by running zdb -D pool | grep DDT-fletcher4 If there are no matches, your pool is not using dedup=fletcher4,verify, and it is safe to install bits with this fix. Build 128 will be respun to include this fix. Sorry for the inconvenience, -- team zfs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
Terrific! Can't wait to read the man pages / blogs about how to use it... Just posted one: http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup Enjoy, and let me know if you have any questions or suggestions for follow-on posts. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple cans ZFS project
Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a private license from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms. I cannot disclose details, but that is the essence of it. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing a failed drive
Yep, you got it. Jeff On Fri, Jun 19, 2009 at 04:15:41PM -0700, Simon Breden wrote: Hi, I have a ZFS storage pool consisting of a single RAIDZ2 vdev of 6 drives, and I have a question about replacing a failed drive, should it occur in future. If a drive fails in this double-parity vdev, then am I correct in saying that I would need to (1) unplug the old drive once I've identified the drive id (c1t0d0 etc), (2) plug in the new drive on the same SATA cable, and (3) issue a 'zpool replace pool_name drive_id' command etc, at which point ZFS will resilver the new drive from the parity data ? Thanks, Simon -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mobo SATA migration to AOC-SAT2-MV8 SATA card
Yep, right again. Jeff On Fri, Jun 19, 2009 at 04:21:42PM -0700, Simon Breden wrote: Hi, I'm using 6 SATA ports from the motherboard but I've now run out of SATA ports, and so I'm thinking of adding a Supermicro AOC-SAT2-MV8 8-port SATA controller card. What is the procedure for migrating the drives to this card? Is it a simple case of (1) issuing a 'zpool export pool_name' command, (2) shutdown, (3) insert card and move all SATA cables for drives from mobo to card, (4) boot and issue a 'zpool import pool_name' command ? Thanks, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver Performance and Behavior
According to the ZFS documentation, a resilver operation includes what is effectively a dirty region log (DRL) so that if the resilver is interrupted, by a snapshot or reboot, the resilver can continue where it left off. That is not the case. The dirty region log keeps track of what time periods a device was offline, so that if a device is goes offline and comes back soon thereafter, only the recent data needs to be resilvered. And for that reason we call it the Dirty Time Log (DTL) rather than DRL. This is efficient because actual device outages are temporal, not spatial. As a rule, a 5-minute outage can be fully resilvered in 5 minutes or less. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Peculiarities of COW over COW?
ZFS blocksize is dynamic, power of 2, with a max size == recordsize. Minor clarification: recordsize is restricted to powers of 2, but blocksize is not -- it can be any multiple of sector size (512 bytes). For small files, this matters: a 37k file is stored in a 37k block. For larger, multi-block files, the size of each block is indeed a power of 2 (simplifies the math a bit). Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data size grew.. with compression on
Yes, I made note of that in my OP on this thread. But is it enough to end up with 8gb of non-compressed files measuring 8gb on reiserfs(linux) and the same data showing nearly 9gb when copied to a zfs filesystem with compression on. whoops.. a hefty exaggeration it only shows about 16mb difference. But still since zfs side is compressed, that seems like quite a lot.. That's because ZFS reports *all* space consumed by a file, including all metadata (dnodes, indirect blocks, etc). For an 8G file stored in 128K blocks, there are 8G / 128K = 64K block pointers, each of which is 128 bytes, and is two-way replicated (via ditto blocks), for a total of 64K * 128 * 2 = 16M. So this is exactly as expected. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data size grew.. with compression on
Right. Another difference to be aware of is that ZFS reports the total space consumed, including space for metadata -- typically around 1%. Traditional filesystems like ufs and ext2 preallocate metadata and don't count it as using space. I don't know how reiserfs does its bookkeeping, but I wouldn't be surprised if it followed that model. Jeff On Mon, Mar 30, 2009 at 02:57:31PM -0400, Brad Plecs wrote: I've run into this too... I believe the issue is that the block size/allocation unit size in ZFS is much larger than the default size on older filesystems (ufs, ext2, ext3). The result is that if you have lots of small files smaller than the block size, they take up more total space on the filesystem because they occupy at least the block size amount. See the 'recordsize' ZFS filesystem property, though re-reading the man pages, I'm not 100% sure that tuning this property will have the intended effect. BP I rsynced an 11gb pile of data from a remote linux machine to a zfs filesystem with compression turned on. The data appears to have grown in size rather than been compressed. Many, even most of the files are formats that are already compressed, such as mpg jpg avi and several others. But also many text files (*.html) are in there. So didn't expect much compression but also didn't expect the size to grow. I realize these are different filesystems that may report differently. Reiserfs on the linux machine and zfs on osol. in bytes: Osol:11542196307 linux:11525114469 = 17081838 Or (If I got the math right) about 16.29 MB bigger on the zfs side with compression on. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- bpl...@cs.umd.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: creating multiple clones in one zfs(1) call and one txg
I agree with Chris -- I'd much rather do something like: zfs clone snap1 clone1 snap2 clone2 snap3 clone3 ... than introduce a pattern grammar. Supporting multiple snap/clone pairs on the command line allows you to do just about anything atomically. Jeff On Fri, Mar 27, 2009 at 10:46:33AM -0500, Chris Kirby wrote: On Mar 27, 2009, at 10:33 AM, Darren J Moffat wrote: a) that is probably what is wanted most of the time anyway b) it is easy to pass from userland to kernel - you pass the rules (after some userland sanity checking first) as is. But doesn't that also exclude the possibility of creating non-pattern based clones in a single txg? While I think that allowing multiple clones to be created in a single txg is perfectly reasonable, we shouldn't need to artificially restrict the clone namespace in order to achieve that. -Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Forensics related ZFS questions
1. Does variable FSB block sizing extend to files larger than record size, concerning the last FSB allocated? In other words, for files larger than 128KB, that utilize more than one full recordsize FSB, will the LAST FSB allocated be `right-sized' to fit the remaining data, or will ZFS allocate a full recordsize FSB for the last `chunk' of the file? The last block is currently a multiple of the recordsize, but we intend to fix this. There are two options: one, to treat the last block as a special case; the other, to handle it automatically via compression. The former is a little more work, but has the advantage of reducing the file's in-memory footprint as well as its on-disk footprint. 2. Can a developer confirm that COW occurs at the FSB level (vs. sector level, for example)? In other words, when a single FSB (say 64KB file w/ recordsize=128KB) file is modified, and it's only one sector within that file that's modified, is it correct that what's copied-on-write is the entire 64KB FSB allocated to that file? (This is a data recovery issue.) Yes, that's correct. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
I'm rather tired of hearing this mantra. [...] Every file system needs a repair utility Hey, wait a minute -- that's a mantra too! I don't think there's actually any substantive disagreement here -- stating that one doesn't need a separate program called /usr/sbin/fsck is not the same as saying that filesystems don't need error detection and recovery. There's quite a bit of that in the current code, and more in the works. Like performance, it is never really done -- we can always do better. I've described before a number of checks which ZFS could perform [...] Well, ZFS is open source. I would love to see your passion for this topic directed at the source code. Seriously. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
This is CR 6667683 http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 I think that would solve 99% of ZFS corruption problems! Based on the reports I've seen to date, I think you're right. Is there any EDT for this patch? Well, because of this thread, this has gone from on my list to I'm currently working on it. And I'd like to take moment to thank everyone who's weighed in, because it really does make a difference in setting priorities. As for a date, I would estimate weeks, not months. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does your device honor write barriers?
wellif you want a write barrier, you can issue a flush-cache and wait for a reply before releasing writes behind the barrier. You will get what you want by doing this for certain. Not if the disk drive just *ignores* barrier and flush-cache commands and returns success. Some consumer drives really do exactly that. That's the issue that people are asking ZFS to work around. But it's important to understand that this failure mode (silently ignoring SCSI commands) is truly a case of broken-by-design hardware. If a disk doesn't honor these commands, then no synchronous operation is ever truly synchronous -- it'd be like your OS just ignoring O_SYNC. This means you can't use such disks for (say) a database or NFS server, because it is *impossible* to know when the data is on stable storage. If it were possible to detect such disks, I'd add code to ZFS that would simply refuse to use them. Unfortunately, there is no reliable way to test the functioning of synchonize-cache programmatically. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
There is no substitute for cord-yank tests - many and often. The weird part is, the ZFS design team simulated millions of them. So the full explanation remains to be uncovered? We simulated power failure; we did not simulate disks that simply blow off write ordering. Any disk that you'd ever deploy in an enterprise or storage appliance context gets this right. The good news is that ZFS is getting popular enough on consumer-grade hardware. The bad news is that said hardware has a different set of failure modes, so it takes a bit of work to become resilient to them. This is pretty high on my short list. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snapshot identity
The Validated Execution project is investigating how to utilize ZFS snapshots as the basis of a validated filesystem. Given that the blocks of the dataset form a Merkel tree of hashes, it seemed straightforward to validate the individual objects in the snapshot and then sign the hash of the root as a means of indicating that the contents of the dataset were validated. Yep, that would work. Unfortunately, the block hashes are used to assure the integrity of the physical representation of the dataset. Those hash values can be updated during scrub operations, or even during data error recovery, while the logical content of the dataset remains intact. Actually, that's not true -- at least not today. Once you've taken a snapshot, the content will never change. Scrub, resilver, and self-heal operations repair damaged copies of data, but they don't alter the data itself, and therefore don't alter its checksum. This will change when we add support for block rewrite, which will allow us to do things like migrate data from one device to another, or to recompress existing data, which *will* affect the checksum. You may be able to tolerate this by simply precluding it, if you're targeting a restricted environment. For example, do you need this feature for anything other than the root pool? Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS core contributor nominations
I would like to nominate roch.bourbonn...@sun.com for his work on improving the performance of ZFS over the last few years. Absolutely. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Where does set the value to zio-io_offset?
Each ZFS block pointer contains up to three DVAs (data virtual addresses), to implement 'ditto blocks' (multiple copies of the data, above and beyond any replication provided by mirroring or RAID-Z). Semantically, ditto blocks are a lot like mirrors, so we actually use the mirror code to read them. We do this even in the degenerate single-copy case because it makes a bunch of other simplifications possible. Each DVA contains a vdev and offset, which are extracted by DVA_GET_VDEV() and DVA_GET_OFFSET() for each DVA in vdev_mirror_map_alloc(), and stored in the mirror map's mc_vd and mc_offset fields. We then pass these values to zio_vdev_child_io(), which zio_create()s a dependent child zio to read or write the data. Jeff On Fri, Jan 23, 2009 at 10:53:35PM -0800, Jin wrote: Assume starting one disk write action, the vdev_disk_io_start will be called from zio_execute. static int vdev_disk_io_start(zio_t *zio) { .. bp-b_lblkno = lbtodb(zio-io_offset); .. } After scaning over the zfs source, I find the zio-io_offset is only set value in zio_create by the parameter offset. zio_write calls zio_create with the value 0 for the parameter offset. I can't find anywhere else the zio-io_offset being set. After the new block born, the correct offset has been filled in bp-blk_dva (see metaslab_alloc), when and where the correct value set to zio-io_offset? Who can tell me? thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
Off the top of my head nearly all of them. Some of them have artificial limitations because they learned the hard way that if you give customers enough rope they'll hang themselves. For instance unlimited snapshots. Oh, that's precious! It's not an arbitrary limit, it's a safety feafure! Outside of that... I don't see ANYTHING in your list they didn't do first. Then you don't know ANYTHING about either platform. Constant-time snapshots, for example. ZFS has them; NetApp's are O(N), where N is the total number of blocks, because that's how big their bitmaps are. If you think O(1) is not a revolutionary improvement over O(N), then not only do you not know much about either snapshot algorithm, you don't know much about computing. Sorry, everyone else, for feeding the troll. Chum the water all you like, I'm done with this thread. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpol mirror creation after non-mirrored zpool is setup
On Sat, Dec 13, 2008 at 04:44:10PM -0800, Mark Dornfeld wrote: I have installed Solaris 10 on a ZFS filesystem that is not mirrored. Since I have an identical disk in the machine, I'd like to add that disk to the existing pool as a mirror. Can this be done, and if so, how do I do it? Yes: # zpool attach poolname old_disk new_disk Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Split responsibility for data with ZFS
I'm going to pitch in here as devil's advocate and say this is hardly revolution. 99% of what zfs is attempting to do is something NetApp and WAFL have been doing for 15 years+. Regardless of the merits of their patents and prior art, etc., this is not something revolutionarily new. It may be revolution in the sense that it's the first time it's come to open source software and been given away, but it's hardly revolutionary in file systems as a whole. 99% of what ZFS is attempting to do? Hmm, OK -- let's make a list: end-to-end checksums unlimited snapshots and clones O(1) snapshot creation O(delta) snapshot deletion O(delta) incremental generation transactionally safe RAID without NVRAM variable blocksize block-level compression dynamic striping intelligent prefetch with automatic length and stride detection ditto blocks to increase metadata replication delegated administration scalability to many cores scalability to huge datasets hybrid storage pools (flash/disk mix) that optimize price/performance How many of those does NetApp have? I believe the correct answer is 0%. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slow death-spiral with zfs gzip-9 compression
If you have more comments, or especially if you think I reached the wrong conclusion, please do post it. I will post my continuing results. I think your conclusions are correct. The main thing you're seeing is the combination of gzip-9 being incredibly CPU-intensive with our I/O pipeline allowing too much of it to be scheduled in parallel. The latter is a bug we will fix; the former is the nature of the gzip algorithm. One other thing you may encounter from time to time is slowdowns due to kernel VA fragmentation. The CPU you're using is 32-bit, so you're running a 32-bit kernel, which has very little KVA. This tends to be more of a problem with big-memory machines, however -- e.g. a system with 8GB running a 32-bit kernel. With 768MB, you'll probably be OK, but it's something to be aware of on any 32-bit system. You can tell if this is affecting you by looking for kernel threads stuck waiting to allocate a virtual address: # echo '::walk thread | ::findstack -v' | mdb -k | grep vmem_xalloc Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Smashing Baby a fake???
I think we (the ZFS team) all generally agree with you. The current nevada code is much better at handling device failures than it was just a few months ago. And there are additional changes that were made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000) product line that will make things even better once the FishWorks team has a chance to catch its breath and integrate those changes into nevada. And then we've got further improvements in the pipeline. The reason this is all so much harder than it sounds is that we're trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) The disks' SMART data is notoriously unreliable, BTW. So there's a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA fault diagnosis and then tell ZFS to take appropriate action. We have some of this today; it's just a lot of work to complete it. Oh, and regarding the original post -- as several readers correctly surmised, we weren't faking anything, we just didn't want to wait for all the device timeouts. Because the disks were on USB, which is a hotplug-capable bus, unplugging the dead disk generated an interrupt that bypassed the timeout. We could have waited it out, but 60 seconds is an eternity on stage. Jeff On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: But that's exactly the problem Richard: AFAIK. Can you state that absolutely, categorically, there is no failure mode out there (caused by hardware faults, or bad drivers) that won't lock a drive up for hours? You can't, obviously, which is why we keep saying that ZFS should have this kind of timeout feature. For once I agree with Miles, I think he's written a really good writeup of the problem here. My simple view on it would be this: Drives are only aware of themselves as an individual entity. Their job is to save restore data to themselves, and drivers are written to minimise any chance of data loss. So when a drive starts to fail, it makes complete sense for the driver and hardware to be very, very thorough about trying to read or write that data, and to only fail as a last resort. I'm not at all surprised that drives take 30 seconds to timeout, nor that they could slow a pool for hours. That's their job. They know nothing else about the storage, they just have to do their level best to do as they're told, and will only fail if they absolutely can't store the data. The raid controller on the other hand (Netapp / ZFS, etc) knows all about the pool. It knows if you have half a dozen good drives online, it knows if there are hot spares available, and it *should* also know how quickly the drives under its care usually respond to requests. ZFS is perfectly placed to spot when a drive is starting to fail, and to take the appropriate action to safeguard your data. It has far more information available than a single drive ever will, and should be designed accordingly. Expecting the firmware and drivers of individual drives to control the failure modes of your redundant pool is just crazy imo. You're throwing away some of the biggest benefits of using multiple drives in the first place. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost Disk Space
Are you running this on a live pool? If so, zdb can't get a reliable block count -- and zdb -L [live pool] emits a warning to that effect. Jeff On Thu, Oct 16, 2008 at 03:36:25AM -0700, Ben Rockwood wrote: I've been struggling to fully understand why disk space seems to vanish. I've dug through bits of code and reviewed all the mails on the subject that I can find, but I still don't have a proper understanding of whats going on. I did a test with a local zpool on snv_97... zfs list, zpool list, and zdb all seem to disagree on how much space is available. In this case its only a discrepancy of about 20G or so, but I've got Thumpers that have a discrepancy of over 6TB! Can someone give a really detailed explanation about whats going on? block traversal size 670225837056 != alloc 720394438144 (leaked 50168601088) bp count:15182232 bp logical:672332631040 avg: 44284 bp physical: 669020836352 avg: 44066compression: 1.00 bp allocated: 670225837056 avg: 44145compression: 1.00 SPA allocated: 720394438144 used: 96.40% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type 12 120K 26.5K 79.5K 6.62K4.53 0.00 deferred free 1512 512 1.50K 1.50K1.00 0.00 object directory 3 1.50K 1.50K 4.50K 1.50K1.00 0.00 object array 116K 1.50K 4.50K 4.50K 10.67 0.00 packed nvlist - - - - - -- packed nvlist size 72 8.45M889K 2.60M 37.0K9.74 0.00 bplist - - - - - -- bplist header - - - - - -- SPA space map header 974 4.48M 2.65M 7.94M 8.34K1.70 0.00 SPA space map - - - - - -- ZIL intent log 96.7K 1.51G389M777M 8.04K3.98 0.12 DMU dnode 17 17.0K 8.50K 17.5K 1.03K2.00 0.00 DMU objset - - - - - -- DSL directory 13 6.50K 6.50K 19.5K 1.50K1.00 0.00 DSL directory child map 12 6.00K 6.00K 18.0K 1.50K1.00 0.00 DSL dataset snap map 14 38.0K 10.0K 30.0K 2.14K3.80 0.00 DSL props - - - - - -- DSL dataset - - - - - -- ZFS znode 2 1K 1K 2K 1K1.00 0.00 ZFS V0 ACL 5.81M 558G557G557G 95.8K1.0089.27 ZFS plain file 382K 301M200M401M 1.05K1.50 0.06 ZFS directory 9 4.50K 4.50K 9.00K 1K1.00 0.00 ZFS master node 12 482K 20.0K 40.0K 3.33K 24.10 0.00 ZFS delete queue 8.20M 66.1G 65.4G 65.8G 8.03K1.0110.54 zvol object 1512 512 1K 1K1.00 0.00 zvol prop - - - - - -- other uint8[] - - - - - -- other uint64[] - - - - - -- other ZAP - - - - - -- persistent error log 1 128K 10.5K 31.5K 31.5K 12.19 0.00 SPA history - - - - - -- SPA history offsets - - - - - -- Pool properties - - - - - -- DSL permissions - - - - - -- ZFS ACL - - - - - -- ZFS SYSACL - - - - - -- FUID table - - - - - -- FUID table size 5 3.00K 2.50K 7.50K 1.50K1.20 0.00 DSL dataset next clones - - - - - -- scrub work queue 14.5M 626G623G624G 43.1K1.00 100.00 Total real21m16.862s user0m36.984s sys 0m5.757s === Looking at the data: [EMAIL PROTECTED] ~$ zfs list backup zpool list backup NAME USED AVAIL REFER MOUNTPOINT backup 685G 237K27K /backup NAME SIZE USED AVAILCAP HEALTH ALTROOT backup 696G 671G 25.1G96% ONLINE - So zdb says 626GB is used, zfs list says 685GB is used, and zpool list says 671GB is used. The pool was filled to 100% capacity via dd, this is confirmed, I can't write data, but yet zpool list says its only 96%. benr. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org
Re: [zfs-discuss] questions about replacing a raidz2 vdev disk with a larger one
ZFS will allow the replacement. The available size is, however, be determined by the smallest of the lot. Once you've replaced *all* 500GB disks with 1TB disks, the available space will double. One suggestion: replace as many disks as you intend to at the same time, so that ZFS only has to do one resilver operation. It's faster that way. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions about replacing a raidz2 vdev disk with a larger one
Actually, you can replace them all at once, as long as you don't unplug the old ones first. Let's say you have a raidz2 setup like this: mypool raidz2 a b c d and you say this: # zpool replace mypool a A # zpool replace mypool b B # zpool replace mypool c C # zpool replace mypool d D Your pool configuration will then become: mypool raidz2 replacing a A replacing b B replacing c C replacing d D The original drives (a, b, c, d) will remain in the pool until the new drives (A, B, C, D) have all the data, at which point the old drives will be detached and the final pool configuration will be: mypool raidz2 A B C D This assumes, of course, that you have enough slots to plug them all in. If you're slot-limited -- i.e. you can't add a new drive without pulling and old one -- then Eric is right, and in fact I'd go further: in that case, replace only one at a time so you maintain the ability to survive a disk failing while you're going all this. Jeff On Sat, Oct 11, 2008 at 06:37:17PM -0700, Erik Trimble wrote: Jeff Bonwick wrote: One suggestion: replace as many disks as you intend to at the same time, so that ZFS only has to do one resilver operation. It's faster that way. Jeff Just to be more clear on this: Assuming you have data you care about on the current raidz2 zpool, you should replace UP TO [2] drives at once. That way, you minimize re-silver times, while keeping all your data intact. If you replace more than 2 at ones, you'll destroy the array's redundancy, and have to restore the data from backup. If you replace one at a time, you'll have to wait for each to resilver before replacing anymore. If you don't care about the data, then, just destroy the zpool, replace the drives, and recreate the zpool from scratch. It's faster and easier than waiting for the resilvers. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. FYI, I'm working on a workaround for broken devices. As you note, some disks flat-out lie: you issue the synchronize-cache command, they say got it, boss, yet the data is still not on stable storage. Why do they do this? Because it performs better. Well, duh -- you can make stuff *really* fast if it doesn't have to be correct. Before I explain how ZFS can fix this, I need to get something off my chest: people who knowingly make such disks should be in federal prison. It is *fraud* to win benchmarks this way. Doing so causes real harm to real people. Same goes for NFS implementations that ignore sync. We have specifications for a reason. People assume that you honor them, and build higher-level systems on top of them. Change the mass of the proton by a few percent, and the stars explode. It is impossible to build a functioning civil society in a culture that tolerates lies. We need a little more Code of Hammurabi in the storage industry. Now: The uberblock ring buffer in ZFS gives us a way to cope with this, as long as we don't reuse freed blocks for a few transaction groups. The basic idea: if we can't read the pool startign from the most recent uberblock, then we should be able to use the one before it, or the one before that, etc, as long as we haven't yet reused any blocks that were freed in those earlier txgs. This allows us to use the normal load on the pool, plus the passage of time, as a displacement flush for disk caches that ignore the sync command. If we go back far enough in (txg) time, we will eventually find an uberblock all of whose dependent data blocks have make it to disk. I'll run tests with known-broken disks to determine how far back we need to go in practice -- I'll bet one txg is almost always enough. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Or is there a way to mitigate a checksum error on non-redundant zpool? It's just like the difference between non-parity, parity, and ECC memory. Most filesystems don't have checksums (non-parity), so they don't even know when they're returning corrupt data. ZFS without any replication can detect errors, but can't fix them (like parity memory). ZFS with mirroring or RAID-Z can both detect and correct (like ECC memory). Note: even in a single-device pool, ZFS metadata is replicated via ditto blocks at two or three different places on the device, so that a localized media failure can be both detected and corrected. If you have two or more devices, even without any mirroring or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) across those devices. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool file corruption
It's almost certainly the SIL3114 controller. Google SIL3114 data corruption -- it's nasty. Jeff On Thu, Sep 25, 2008 at 07:50:01AM +0200, Mikael Karlsson wrote: I have a strange problem involving changes in large file on a mirrored zpool in Open solaris snv96. We use it at storage in a VMware ESXi lab environment. All virtual disk files gets corrupted when changes are made within the files (when running the machine that is). The sad thing is that I've created about ~200Gb of random data in large files and even modified those files without any problem (using dd with skip and conv=notrunc options). I've copied the files within the pool and over the network on all network interfaces on the machine - without problems. It's just those .vmdk files that gets corrupted. The hardware is an Opteron desktop machine with a SIL3114 sata interface. Personally I have exactly the same interface at home with the same setup without problem. Only the other hardware differs (disks and so on). The disks are WD7500AACS, which is those with variable rotation speed 5400-7200. Could it be the disks? Could it be the disk controller or the rest of the hardware?? I should mention that the controller has been flashed with a non-raid bios. I could provide more information if needed! Is there anyone that have any ideas or suggestions? Some output: bash-3.00# zpool status -vx pool: testing state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 1 errors on Wed Sep 24 16:59:13 2008 config: NAMESTATE READ WRITE CKSUM testing ONLINE 0 016 mirrorONLINE 0 016 c0d1ONLINE 0 051 c1d1ONLINE 0 054 errors: Permanent errors have been detected in the following files: /testing/ZFS-problem/ZFS-problem-flat.vmdk Regards Mikael ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove log device?
You are correct, and it is indeed annoying. I hope to have this fixed by the end of the month. Jeff On Sun, Jul 13, 2008 at 10:16:55PM -0500, Mike Gerdts wrote: It seems as though there is no way to remove a log device once it is added. Is this correct? Assuming this is correct, is there any reason that adding the ability to remove the log device would be particularly tricky? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub never finishes
ZFS co-inventor Matt Ahrens recently fixed this: 6343667 scrub/resilver has to start over when a snapshot is taken Trust me when I tell you that solving this correctly was much harder than you might expect. Thanks again, Matt. Jeff On Sun, Jul 13, 2008 at 07:08:48PM -0700, Anil Jangity wrote: Oh, my hunch was right. Yup, I do have an hourly snapshot going. I'll take it out and see. Thanks! Bob Friesenhahn wrote: On Sun, 13 Jul 2008, Anil Jangity wrote: On one of the pools, I started a scrub. It never finishes. At one time, I saw it go up to like 70% and then a little bit later I ran the pool status, it went back to 5% and started again. What is going on? Here is the pool layout: Initiating a snapshot stops the scrub. I don't know if the scrub is restarted at 0%, or simply aborted. Are you taking snapshots during the scrub? Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scrub failing to initialise
If the cabling outage was transient, the disk driver would simply retry until they came back. If it's a hotplug-capable bus and the disks were flagged as missing, ZFS would by default wait until the disks came back (see zpool get failmode pool), and complete the I/O then. There would be no missing disk writes, hence nothing to resilver. Jeff On Mon, Jul 07, 2008 at 06:55:02PM +0200, Justin Vassallo wrote: Hi, I've got a zpool made up of 2 mirrored vdevs. For one moment i had a cabling problem and lost all disks... i reconnected and onlined the disks. No resilvering kicked in, so i tried to force a scrub, but nothing's happening. I issue the command and it's as if i never did. Any suggestions? Thanks justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] is it possible to add a mirror device later?
I would just swap the physical locations of the drives, so that the second half of the mirror is in the right location to be bootable. ZFS won't mind -- it tracks the disks by content, not by pathname. Note that SATA is not hotplug-happy, so you're probably best off doing this while the box is powered off. Upon reboot, ZFS should figure out what happened, update the device paths, and... that's it. Jeff On Sun, Jul 06, 2008 at 08:47:25AM +0200, Tommaso Boccali wrote: As Edna and Robert mentioned, zpool attach will add the mirror. But note that the X4500 has only two possible boot devices: c5t0d0 and c5t4d0. This is a BIOS limitation. So you will want to mirror with c5t4d0 and configure the disks for boot. See the docs on ZFS boot for details on how to configure the boot sectors and grub. -- richard uhm, bad. I did not know this, so now the root is bash-3.2# zpool status rpool pool: rpool state: ONLINE scrub: resilver completed after 0h8m with 0 errors on Wed Jul 2 16:09:14 2008 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c5t0d0s0 ONLINE 0 0 0 c1t7d0ONLINE 0 0 0 spares c0t7d0 AVAIL c1t6d0 AVAIL while c5t4d0 belongs to a raiz pool: ... raidz1ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c5t4d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 ... is it possible to restore the good behavior? something like - detach c1t7d0 from rpool - detach c5t4d0 from the other pool (the pool still survives since it is raidz) - reattach in reverse order? (and so reform mirror and raidz?) thanks a lot again tommaso -- Tommaso Boccali INFN Pisa ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] confusion and frustration with zpool
As a first step, 'fmdump -ev' should indicate why it's complaining about the mirror. Jeff On Sun, Jul 06, 2008 at 07:55:22AM -0700, Pete Hartman wrote: I'm doing another scrub after clearing insufficient replicas only to find that I'm back to the report of insufficient replicas, which basically leads me to expect this scrub (due to complete in about 5 hours from now) won't have any benefit either. -bash-3.2# zpool status local pool: local state: FAULTED scrub: scrub in progress for 0h32m, 9.51% done, 5h11m to go config: NAME STATE READ WRITE CKSUM local FAULTED 0 0 0 insufficient replicas mirror ONLINE 0 0 0 c6d1p0ONLINE 0 0 0 c0t0d0s3 ONLINE 0 0 0 mirror ONLINE 0 0 0 c6d0p0ONLINE 0 0 0 c0t0d0s4 ONLINE 0 0 0 mirror UNAVAIL 0 0 0 corrupted data c8t0d0p0 ONLINE 0 0 0 c0t0d0s5 ONLINE 0 0 0 errors: No known data errors This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bug id 6343667
FYI, we are literally just days from having this fixed. Matt: after putback you really should blog about this one -- both to let people know that this long-standing bug has been fixed, and to describe your approach to it. It's a surprisingly tricky and interesting problem. Jeff On Sat, Jul 05, 2008 at 01:20:11PM -0700, Ross wrote: If it ever does get released I'd love to hear about it. That bug, and the fact it appears to have been outstanding for three years, was one of the major reasons behind us not purchasing a bunch of x4500's. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing GUID
How difficult would it be to write some code to change the GUID of a pool? As a recreational hack, not hard at all. But I cannot recommend it in good conscience, because if the pool contains more than one disk, the GUID change cannot possibly be atomic. If you were to crash or lose power in the middle of the operation, your data would be gone. What problem are you trying to solve? Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume
To be honest, it is not quite clear to me, how we might utilize dumpadm(1M) to help us to calculate/recommend size of dump device. Could you please elaborate more on this ? dumpadm(1M) -c specifies the dump content, which can be kernel, kernel plus current process, or all memory. If the dump content is 'all', the dump space needs to be as large as physical memory. If it's just 'kernel', it can be some fraction of that. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume
The problem is that size-capping is the only control we have over thrashing right now. It's not just thrashing, it's also any application that leaks memory. Without a cap, the broken application would continue plowing through memory until it had consumed every free block in the storage pool. What we really want is dynamic allocation with lower and upper bounds to ensure that there's always enough swap space, and that a reasonable upper limit isn't exceeded. As fortune would have it, that's exactly what we get with quotas and reservations on zvol-based swap today. If you prefer uncapped behavior, no problem -- unset the reservation and grow the swap zvol to 16EB. (Ultimately it would be cleaner to express this more directly, rather than via the nominal size of an emulated volume. The VM 2.0 project will address that, along with many other long-standing annoyances.) Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [caiman-discuss] swap dump on ZFS volume
Neither swap or dump are mandatory for running Solaris. Dump is mandatory in the sense that losing crash dumps is criminal. Swap is more complex. It's certainly not mandatory. Not so long ago, swap was typically larger than physical memory. But in recent years, we've essentially moved to a world in which paging is considered a bug. Swap devices are often only a fraction of physical memory size now, which raises the question of why we even bother. On my desktop, which has 16GB of memory, the default OpenSolaris swap partition is 2GB. That's just stupid. Unless swap space significantly expands the amount of addressable virtual memory, there's no reason to have it. There have been a number of good suggestions here: (1) The right way to size the dump device is to let dumpadm(1M) do it based on the dump content type. (2) In a virtualized environment, a better way to get a crash dump would be to snapshot the VM. This would require a little bit of host/guest cooperation, in that the installer (or dumpadm) would have to know that it's operating in a VM, and the kernel would need some way to notify the VM that it just panicked. Both of these ought to be doable. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool with RAID-5 from intelligent storage arrays
Using ZFS to mirror two hardware RAID-5 LUNs is actually quite nice. Because the data is mirrored at the ZFS level, you get all the benefits of self-healing. Moreover, you can survive a great variety of hardware failures: three or more disks can die (one in the first array, two or more in the second), failure of a cable, or failure of an entire array. Jeff On Sat, Jun 14, 2008 at 08:09:49AM -0700, zfsmonk wrote: Mentioned on http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide is the following: ZFS works well with storage based protected LUNs (RAID-5 or mirrored LUNs from intelligent storage arrays). However, ZFS cannot heal corrupted blocks that are detected by ZFS checksums. based upon that, if we have LUNs already in RAID5 being served from intelligent storage arrays, is it any benefit to create the zpool in a mirror if zfs can't heal any corrupted blocks? Or would we just be wasting disk space? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Deferred Frees
When a block is freed as part of transaction group N, it can be reused in transaction group N+1. There's at most a one-txg (few-second) delay. Jeff On Mon, Jun 16, 2008 at 01:02:53PM -0400, Torrey McMahon wrote: I'm doing some simple testing of ZFS block reuse and was wondering when deferred frees kick in. Is it on some sort of timer to ensure data consistency? Does an other routine call it? Would something as simple as sync(1M) get the free block list written out so future allocations could use the space? ... or am I way off in the weeds? :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mirror broken?
If you say 'zpool online pool disk' that should tell ZFS that the disk is healthy again and automatically kick off a resilver. Of course, that should have happened automatically. What version of ZFS / Solaris are you running? Jeff On Fri, Jun 20, 2008 at 06:01:25PM +0200, Justin Vassallo wrote: Hi, I have a zpool made of 2 vdev mirrors, with disks connected via USB hub. While one vdev was resilvering at 22% (HD replacement), the original disk went away (seems the USB hub is the culprit). I turned the disk off and back on. The status of the disk came back to ONLINE, but there is no resilvering happening. Disks are cool and idle. An clues what could be happening here? Should i plug out / in the new disk again? I can't check what status the data is in, because it was being used by a non-global zone which is failing to start, but that's another porblem: # zoneadm -z ZONE boot could not verify fs /data: could not access /tank/data: No such file or directory zoneadm: zone ZONE failed to verify justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot delete errored file
That's odd -- the only way the 'rm' should fail is if it can't read the znode for that file. The znode is metadata, and is therefore stored in two distinct places using ditto blocks. So even if you had one unlucky copy that was damaged on two of your disks, you should still have another copy elsewhere. Assuming you weren't so shockingly unlucky, the only way to get a corrupted znode that I know of is flaky memory, such that the znode is checksummed, then the DRAM flips a bit, then you write the znode to disk. The fact that you've seen so many checksum errors makes me suspect hardware all the more. Can you send me the output of fmdump -ev and fmdump -eV ? There should be some useful crumbs in there... Jeff On Tue, Jun 03, 2008 at 04:27:21AM -0700, Ben Middleton wrote: Hi, I can't seem to delete a file in my zpool that has permanent errors: zpool status -vx pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 2h10m with 1 errors on Tue Jun 3 11:36:49 2008 config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /export/duke/test/Acoustic/3466/88832/09 - Check.mp3 rm /export/duke/test/Acoustic/3466/88832/09 - Check.mp3 rm: cannot remove `/export/duke/test/Acoustic/3466/88832/09 - Check.mp3': I/O error Each time I try to do anything to the file, the checksum error count goes up on the pool. I also tried a mv and a cp over the top - but same I/O error. I performed a zpool scrub rpool followed by a zpool clear rpool - but still get the same error. Any ideas? PS - I'm running snv_86, and use the sata driver on an intel x86 architecture. B This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [caiman-discuss] disk names?
I agree with that. format(1M) and cfgadm(1M) are, ah, not the most user-friendly tools. It would be really nice to have 'zpool disks' go out and taste all the drives to see which ones are available. We already have most of the code to do it. 'zpool import' already contains the taste-all-disks-and-slices logic, and 'zpool add' already contains the logic to determine whether a device is in use. Looks like all we're really missing is a call to printf()... Is there an RFE for this? If not, I'll file one. I like the idea. Jeff On Wed, Jun 04, 2008 at 10:55:18AM -0500, Bob Friesenhahn wrote: On Tue, 3 Jun 2008, Dave Miner wrote: Putting into the zpool command would feel odd to me, but I agree that there may be a useful utility here. There is value to putting this functionality in zpool for the same reason that it was useful to put 'iostat' and other duplicate functionality in zpool. For example, zpool can skip disks which are already currently in use, or it can recommend whole disks (rather than partitions) if none of the logical disk partitions are currently in use. The zfs commands are currently at least an order of magnitude easier to comprehend and use than the legacy commands related to storage devices. It would be nice if the zfs commands will continue to simplify what is now quite obtuse. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with raidz
Very cool! Just one comment. You said: We'll try compression level #9. gzip-9 is *really* CPU-intensive, often for little gain over gzip-1. As in, it can take 100 times longer and yield just a few percent gain. The CPU cost will limit write bandwidth to a few MB/sec per core. I'd suggest that you begin by doing a simple experiment -- create a filesystem at each compression level, copy representative identical data to each one, and compare space usage. My guess is that you'll find the knee in the cost/benefit curve well below gzip-9. Also, if you're storing jpegs or video files, those are already compressed, in which case the benefit will zero even at gzip-9. That said, the other consideration is how you're using the storage. If the write rate is modest and disk space is at a premium, the CPU cost may simply not matter. And note that only writes are affected: when reading data back, gzip is equally fast regardless of level. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovering data from a dettach mirrored vdev
Yes, I think that would be useful. Something like 'zpool revive' or 'zpool undead'. It would not be completely general-purpose -- in a pool with multiple mirror devices, it could only work if all replicas were detached in the same txg -- but for the simple case of a single top-level mirror vdev, or a clean 'zpool split', it's actually pretty straightforward. Jeff On Tue, May 06, 2008 at 11:16:25AM +0100, Darren J Moffat wrote: Great tool, any chance we can have it integrated into zpool(1M) so that it can find and fixup on import detached vdevs as new pools ? I'd think it would be reasonable to extend the meaning of 'zpool import -D' to list detached vdevs as well as destroyed pools. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovering data from a dettach mirrored vdev
Oh, you're right! Well, that will simplify things! All we have to do is convince a few bits of code to ignore ub_txg == 0. I'll try a couple of things and get back to you in a few hours... Jeff On Fri, May 02, 2008 at 03:31:52AM -0700, Benjamin Brumaire wrote: Hi, while diving deeply in zfs in order to recover data I found that every uberblock in label0 does have the same ub_rootbp and a zeroed ub_txg. Does it means only ub_txg was touch while detaching? Hoping it is the case, I modified ub_txg from one uberblock to match the tgx from the label and now I try to calculate the new SHA256 checksum but I failed. Can someone explain what I did wrong? And of course how to do it correctly? bbr The example is from a valid uberblock which belongs an other pool. Dumping the active uberblock in Label 0: # dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=1024 | od -x 1024+0 records in 1024+0 records out 000 b10c 00ba 0009 020 8bf2 8eef f6db c46f 4dcc 040 bba8 481a 0001 060 05e6 0003 0001 100 05e6 005b 0001 120 44e9 00b2 0001 0703 800b 140 160 8bf2 200 0018 a981 2f65 0008 220 e734 adf2 037a cedc d398 c063 240 da03 8a6e 26fc 001c 260 * 0001720 7a11 b10c da7a 0210 0001740 3836 20fb e2a7 a737 a947 feed 43c5 c045 0001760 82a8 133d 0ba7 9ce7 e5d5 64e2 2474 3b03 0002000 Checksum is at pos 01740 01760 I try to calculate it assuming only uberblock is relevant. #dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=168 | digest -a sha256 168+0 records in 168+0 records out 710306650facf818e824db5621be394f3b3fe934107bdfc861bbc82cb9e1bbf3 Helas not matching :-( This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] lost zpool when server restarted.
It's OK that you're missing labels 2 and 3 -- there are four copies precisely so that you can afford to lose a few. Labels 2 and 3 are at the end of the disk. The fact that only they are missing makes me wonder if someone resized the LUNs. Growing them would be OK, but shrinking them would indeed cause the pool to fail to open (since part of it was amputated). There ought to be more helpful diagnostics in the FMA error log. After a failed attempt to import, type this: # fmdump -ev and let me know what it says. Jeff On Tue, Apr 29, 2008 at 03:31:53PM -0400, Krzys wrote: I have a problem on one of my systems with zfs. I used to have zpool created with 3 luns on SAN. I did not have to put any raid or anything on it since it was already using raid on SAN. Anyway server rebooted and I cannot zee my pools. When I do try to import it it does fail. I am using EMC Clarion as SAN and powerpath # zpool list no pools available # zpool import -f pool: mypool id: 4148251638983938048 state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-3C config: mypool UNAVAIL insufficient replicas emcpower0a UNAVAIL cannot open emcpower2a UNAVAIL cannot open emcpower3a ONLINE I think I am able to see all the luns and I should be able to access them on my sun box. # powermt display dev=all Pseudo name=emcpower0a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A001264FB20990FDC11 [LUN 13] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d0s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d0s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d0s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d0s0 SP B4 active alive 0 0 Pseudo name=emcpower1a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A004C1388343C10DC11 [LUN 14] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d1s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d1s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d1s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d1s0 SP B4 active alive 0 0 Pseudo name=emcpower3a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=6006016045201A00A82C68514E86DC11 [LUN 7] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B == Host --- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors == 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016041E035A4d3s0 SP A4 active alive 0 0 3074 [EMAIL PROTECTED],70/[EMAIL PROTECTED]/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c2t5006016941E035A4d3s0 SP B5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016141E035A4d3s0 SP A5 active alive 0 0 3072 [EMAIL PROTECTED],70/[EMAIL PROTECTED],2/SUNW,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 c3t5006016841E035A4d3s0 SP B4 active alive 0 0 Pseudo name=emcpower2a CLARiiON ID=APM00070202835 [NRHAPP02] Logical device ID=600601604B141B00C2F6DB2AC349DC11 [LUN 24] state=alive; policy=CLAROpt; priority=0; queued-IOs=0 Owner: default=SP B, current=SP B
Re: [zfs-discuss] lost zpool when server restarted.
Looking at the txg numbers, it's clear that labels on to devices that are unavailable now may be stale: Actually, they look OK. The txg values in the label indicate the last txg in which the pool configuration changed for devices in that top-level vdev (e.g. mirror or raid-z group), not the last txg synced. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovering data from a dettach mirrored vdev
OK, here you go. I've successfully recovered a pool from a detached device using the attached binary. You can verify its integrity against the following MD5 hash: # md5sum labelfix ab4f33d99fdb48d9d20ee62b49f11e20 labelfix It takes just one argument -- the disk to repair: # ./labelfix /dev/rdsk/c0d1s4 If all goes according to plan, your old pool should be importable. If you do a zpool status -v, it will complain that the old mirrors are no longer there. You can clean that up by detaching them: # zpool detach mypool guid where guid is the long integer that zpool status -v reports as the name of the missing device. Good luck, and please let us know how it goes! Jeff On Sat, May 03, 2008 at 10:48:34PM -0700, Jeff Bonwick wrote: Oh, you're right! Well, that will simplify things! All we have to do is convince a few bits of code to ignore ub_txg == 0. I'll try a couple of things and get back to you in a few hours... Jeff On Fri, May 02, 2008 at 03:31:52AM -0700, Benjamin Brumaire wrote: Hi, while diving deeply in zfs in order to recover data I found that every uberblock in label0 does have the same ub_rootbp and a zeroed ub_txg. Does it means only ub_txg was touch while detaching? Hoping it is the case, I modified ub_txg from one uberblock to match the tgx from the label and now I try to calculate the new SHA256 checksum but I failed. Can someone explain what I did wrong? And of course how to do it correctly? bbr The example is from a valid uberblock which belongs an other pool. Dumping the active uberblock in Label 0: # dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=1024 | od -x 1024+0 records in 1024+0 records out 000 b10c 00ba 0009 020 8bf2 8eef f6db c46f 4dcc 040 bba8 481a 0001 060 05e6 0003 0001 100 05e6 005b 0001 120 44e9 00b2 0001 0703 800b 140 160 8bf2 200 0018 a981 2f65 0008 220 e734 adf2 037a cedc d398 c063 240 da03 8a6e 26fc 001c 260 * 0001720 7a11 b10c da7a 0210 0001740 3836 20fb e2a7 a737 a947 feed 43c5 c045 0001760 82a8 133d 0ba7 9ce7 e5d5 64e2 2474 3b03 0002000 Checksum is at pos 01740 01760 I try to calculate it assuming only uberblock is relevant. #dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=168 | digest -a sha256 168+0 records in 168+0 records out 710306650facf818e824db5621be394f3b3fe934107bdfc861bbc82cb9e1bbf3 Helas not matching :-( This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss labelfix Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovering data from a dettach mirrored vdev
Oh, and here's the source code, for the curious: #include devid.h #include dirent.h #include errno.h #include libintl.h #include stdlib.h #include string.h #include sys/stat.h #include unistd.h #include fcntl.h #include stddef.h #include sys/vdev_impl.h /* * Write a label block with a ZBT checksum. */ static void label_write(int fd, uint64_t offset, uint64_t size, void *buf) { zio_block_tail_t *zbt, zbt_orig; zio_cksum_t zc; zbt = (zio_block_tail_t *)((char *)buf + size) - 1; zbt_orig = *zbt; ZIO_SET_CHECKSUM(zbt-zbt_cksum, offset, 0, 0, 0); zio_checksum(ZIO_CHECKSUM_LABEL, zc, buf, size); VERIFY(pwrite64(fd, buf, size, offset) == size); *zbt = zbt_orig; } int main(int argc, char **argv) { int fd; vdev_label_t vl; nvlist_t *config; uberblock_t *ub = (uberblock_t *)vl.vl_uberblock; uint64_t txg; char *buf; size_t buflen; VERIFY(argc == 2); VERIFY((fd = open(argv[1], O_RDWR)) != -1); VERIFY(pread64(fd, vl, sizeof (vdev_label_t), 0) == sizeof (vdev_label_t)); VERIFY(nvlist_unpack(vl.vl_vdev_phys.vp_nvlist, sizeof (vl.vl_vdev_phys.vp_nvlist), config, 0) == 0); VERIFY(nvlist_lookup_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0); VERIFY(txg == 0); VERIFY(ub-ub_txg == 0); VERIFY(ub-ub_rootbp.blk_birth != 0); txg = ub-ub_rootbp.blk_birth; ub-ub_txg = txg; VERIFY(nvlist_remove_all(config, ZPOOL_CONFIG_POOL_TXG) == 0); VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG, txg) == 0); buf = vl.vl_vdev_phys.vp_nvlist; buflen = sizeof (vl.vl_vdev_phys.vp_nvlist); VERIFY(nvlist_pack(config, buf, buflen, NV_ENCODE_XDR, 0) == 0); label_write(fd, offsetof(vdev_label_t, vl_uberblock), 1ULL UBERBLOCK_SHIFT, ub); label_write(fd, offsetof(vdev_label_t, vl_vdev_phys), VDEV_PHYS_SIZE, vl.vl_vdev_phys); fsync(fd); return (0); } Jeff On Sun, May 04, 2008 at 01:21:27AM -0700, Jeff Bonwick wrote: OK, here you go. I've successfully recovered a pool from a detached device using the attached binary. You can verify its integrity against the following MD5 hash: # md5sum labelfix ab4f33d99fdb48d9d20ee62b49f11e20 labelfix It takes just one argument -- the disk to repair: # ./labelfix /dev/rdsk/c0d1s4 If all goes according to plan, your old pool should be importable. If you do a zpool status -v, it will complain that the old mirrors are no longer there. You can clean that up by detaching them: # zpool detach mypool guid where guid is the long integer that zpool status -v reports as the name of the missing device. Good luck, and please let us know how it goes! Jeff On Sat, May 03, 2008 at 10:48:34PM -0700, Jeff Bonwick wrote: Oh, you're right! Well, that will simplify things! All we have to do is convince a few bits of code to ignore ub_txg == 0. I'll try a couple of things and get back to you in a few hours... Jeff On Fri, May 02, 2008 at 03:31:52AM -0700, Benjamin Brumaire wrote: Hi, while diving deeply in zfs in order to recover data I found that every uberblock in label0 does have the same ub_rootbp and a zeroed ub_txg. Does it means only ub_txg was touch while detaching? Hoping it is the case, I modified ub_txg from one uberblock to match the tgx from the label and now I try to calculate the new SHA256 checksum but I failed. Can someone explain what I did wrong? And of course how to do it correctly? bbr The example is from a valid uberblock which belongs an other pool. Dumping the active uberblock in Label 0: # dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=1024 | od -x 1024+0 records in 1024+0 records out 000 b10c 00ba 0009 020 8bf2 8eef f6db c46f 4dcc 040 bba8 481a 0001 060 05e6 0003 0001 100 05e6 005b 0001 120 44e9 00b2 0001 0703 800b 140 160 8bf2 200 0018 a981 2f65 0008 220 e734 adf2 037a cedc d398 c063 240 da03 8a6e 26fc 001c 260 * 0001720 7a11 b10c da7a 0210 0001740 3836 20fb e2a7 a737 a947 feed 43c5 c045 0001760 82a8 133d 0ba7 9ce7 e5d5 64e2 2474 3b03 0002000 Checksum is at pos 01740 01760 I try to calculate it assuming only uberblock is relevant. #dd if=/dev/dsk/c0d1s4 bs=1 iseek=247808 count=168 | digest -a sha256 168+0 records in 168+0 records out 710306650facf818e824db5621be394f3b3fe934107bdfc861bbc82cb9e1bbf3
Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools
Indeed, things should be simpler with fewer (generally one) pool. That said, I suspect I know the reason for the particular problem you're seeing: we currently do a bit too much vdev-level caching. Each vdev can have up to 10MB of cache. With 132 pools, even if each pool is just a single iSCSI device, that's 1.32GB of cache. We need to fix this, obviously. In the interim, you might try setting zfs_vdev_cache_size to some smaller value, like 1MB. Still, I'm curious -- why lots of pools? Administration would be simpler with a single pool containing many filesystems. Jeff On Wed, Apr 30, 2008 at 11:48:07AM -0700, Bill Moore wrote: A silly question: Why are you using 132 ZFS pools as opposed to a single ZFS pool with 132 ZFS filesystems? --Bill On Wed, Apr 30, 2008 at 01:53:32PM -0400, Chris Siebenmann wrote: I have a test system with 132 (small) ZFS pools[*], as part of our work to validate a new ZFS-based fileserver environment. In testing, it appears that we can produce situations that will run the kernel out of memory, or at least out of some resource such that things start complaining 'bash: fork: Resource temporarily unavailable'. Sometimes the system locks up solid. I've found at least two situations that reliably do this: - trying to 'zpool scrub' each pool in sequence (waiting for each scrub to complete before starting the next one). - starting simultaneous sequential read IO from all pools from a NFS client. (trying to do the same IO from the server basically kills the server entirely.) If I aggregate the same disk space into 12 pools instead of 132, the same IO load does not kill the system. The ZFS machine is an X2100 M2 with 2GB of physical memory and 1GB of swap, running 64-bit Solaris 10 U4 with an almost current set of patches; it gets the storage from another machine via ISCSI. The pools are non-redundant, with each vdev being a whole ISCSI LUN. Is this a known issue (or issues)? If this isn't a known issue, does anyone have pointers to good tools to trace down what might be happening and where memory is disappearing and so on? Does the system plain need more memory for this number of pools and if so, does anyone know how much? Thanks in advance. (I was pointed to mdb -k's '::kmastat' by some people on the OpenSolaris IRC channel but I haven't spotted anything particularly enlightening in its output, and I can't run it once the system has gone over the edge.) - cks [*: we have an outstanding uncertainty over how many ZFS pools a single system can sensibly support, so testing something larger than we'd use in production seemed sensible.] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovering data from a dettach mirrored vdev
If your entire pool consisted of a single mirror of two disks, A and B, and you detached B at some point in the past, you *should* be able to recover the pool as it existed when you detached B. However, I just tried that experiment on a test pool and it didn't work. I will investigate further and get back to you. I suspect it's perfectly doable, just currently disallowed due to some sort of error check that's a little more conservative than necessary. Keep that disk! Jeff On Mon, Apr 28, 2008 at 10:33:32PM -0700, Benjamin Brumaire wrote: Hi, my system (solaris b77) was physically destroyed and i loosed data saved in a zpool mirror. The only thing left is a dettached vdev from the pool. I'm aware that uberblock is gone and that i can't import the pool. But i still hope their is a way or a tool (like tct http://www.porcupine.org/forensics/) i can go too recover at least partially some data) thanks in advance for any hints. bbr This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovering data from a dettach mirrored vdev
Urgh. This is going to be harder than I thought -- not impossible, just hard. When we detach a disk from a mirror, we write a new label to indicate that the disk is no longer in use. As a side effect, this zeroes out all the old uberblocks. That's the bad news -- you have no uberblocks. The good news is that the uberblock only contains one field that's hard to reconstruct: ub_rootbp, which points to the root of the block tree. The root block *itself* is still there -- we just have to find it. The root block has a known format: it's a compressed objset_phys_t, almost certainly one sector in size (could be two, but very unlikely because the root objset_phys_t is highly compressible). It should be possible to write a program that scans the disk, reading each sector and attempting to decompress it. If it decompresses into exactly 1K (size of an uncompressed objset_phys_t), then we can look at all the fields to see if they look plausible. Among all candidates we find, the one whose embedded meta-dnode has the highest birth time in its dn_blkptr is the one we want. I need to get some sleep now, but I'll code this up in a couple of days and we can take it from there. If this is time-sensitive, let me know and I'll see if I can find someone else to drive it. [ I've got a bunch of commitments tomorrow, plus I'm supposed to be on vacation... typical... ;-) ] Jeff On Tue, Apr 29, 2008 at 12:15:21AM -0700, Benjamin Brumaire wrote: Jeff thank you very much for taking time to look at this. My entire pool consisted of a single mirror of two slices on different disks A and B. I attach a third slice on disk C and wait for resilver and then detach it. Now disks A and B burned and I have only disk C at hand. bbr This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of one single 'cp'
No, that is definitely not expected. One thing that can hose you is having a single disk that performs really badly. I've seen disks as slow as 5 MB/sec due to vibration, bad sectors, etc. To see if you have such a disk, try my diskqual.sh script (below). On my desktop system, which has 8 drives, I get: # ./diskqual.sh c1t0d0 65 MB/sec c1t1d0 63 MB/sec c2t0d0 59 MB/sec c2t1d0 63 MB/sec c3t0d0 60 MB/sec c3t1d0 57 MB/sec c4t0d0 61 MB/sec c4t1d0 61 MB/sec The diskqual test is non-destructive (it only does reads), but to get valid numbers you should run it on an otherwise idle system. -- #!/bin/ksh disks=`format /dev/null | grep c.t.d | nawk '{print $2}'` getspeed1() { ptime dd if=/dev/rdsk/${1}s0 of=/dev/null bs=64k count=1024 21 | nawk '$1 == real { printf(%.0f\n, 67.108864 / $2) }' } getspeed() { for iter in 1 2 3 do getspeed1 $1 done | sort -n | tail -2 | head -1 } for disk in $disks do echo $disk `getspeed $disk` MB/sec done -- Jeff On Tue, Apr 08, 2008 at 06:44:13AM -0700, Henrik Hjort wrote: Hi! I just want to check with the community to see if this is normal. I have used a X4500 with 500Gb disks and I'm not impressed by the copy performance. I can run several jobs in parallel and get close to 400mb/s but I need better performance from a single copy. I have tried to be EVIL as well but without success. Tests done with: Solaris 10 U4 Solaris 10 U5 (B10) Nevada B86 *Setup* # zpool status pool: datapool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM datapoolONLINE 0 0 0 mirrorONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c6t0d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 c6t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c4t2d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c5t2d0 ONLINE 0 0 0 c6t2d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c5t3d0 ONLINE 0 0 0 c6t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c6t4d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c5t5d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c5t6d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c6t6d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t7d0 ONLINE 0 0 0 *Result* - Around 50-60mb/s read parsing profile for config: copyfiles Running /tmp/temp165-231.*.*.COM-zfs-readtest-Apr_8_2008-09h_09m_07s/copyfiles/thisrun.f FileBench Version 1.2.2 5109: 0.005: CopyFiles Version 2.3 personality successfully loaded 5109: 0.005: Creating/pre-allocating files and
Re: [zfs-discuss] zfs filesystem metadata checksum
Not at present, but it's a good RFE. Unfortunately it won't be quite as simple as just adding an ioctl to report the dnode checksum. To see why, consider a file with one level of indirection: that is, it consists of a dnode, a single indirect block, and several data blocks. The indirect block contains the checksums of all the data blocks -- handy. The dnode contains the checksum of the indirect block -- but that's not so handy, because the indirect block contains more than just checksums; it also contains pointers to blocks, which are specific to the physical layout of the data on your machine. If you did remote replication using zfs send | ssh elsewhere zfs recv, the dnode checksum on 'elsewhere' would not be the same. Jeff On Tue, Apr 08, 2008 at 01:45:16PM -0700, asa wrote: Hello all. I am looking to be able to verify my zfs backups in the most minimal way, ie without having to md5 the whole volume. Is there a way to get a checksum for a snapshot and compare it to another zfs volume, containing all the same blocks and verify they contain the same information? Even when I destroy the snapshot on the source? kind of like: zfs create tank/myfs dd if=/dev/urandom bs=128k count=1000 of=/tank/myfs/TESTFILE zfs snapshot tank/[EMAIL PROTECTED] zfs send tank/[EMAIL PROTECTED] | zfs recv tank/myfs_BACKUP zfs destroy tank/[EMAIL PROTECTED] zfs snapshot tank/[EMAIL PROTECTED] someCheckSumVodooFunc(tank/myfs) someCheckSumVodooFunc(tank/myfs_BACKUP) is there some zdb hackery which results in a metadata checksum usable in this scenario? Thank you all! Asa zfs worshiper Berkeley, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Per filesystem scrub
Aye, or better yet -- give the scrub/resilver/snap reset issue fix very high priority. As it stands snapshots are impossible when you need to resilver and scrub (even on supposedly sun supported thumper configs). No argument. One of our top engineers is working on this as we speak. I say we all buy him a drink when he integrates the fix. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Per filesystem scrub
Peter, That's a great suggestion. And as fortune would have it, we have the code to do it already. Scrubbing in ZFS is driven from the logical layer, not the physical layer. When you scrub a pool, you're really just scrubbing the pool-wide metadata, then scrubbing each filesystem. At 50,000 feet, it's as simple as adding a zfs(1M) scrub subcommand and having it invoke the already-existing DMU traverse interface. Closer to ground, there are a few details to work out -- we need an option to specify whether to include snapshots, whether to descend recursively (in the case of nested filesystems), and how to handle branch points (which are created by clones). Plus we need some way to name the MOS (meta-object set, which is where we keep all pool metadata) so you can ask to scrub only that. Sounds like a nice tidy project for a summer intern! Jeff On Sat, Mar 29, 2008 at 05:14:20PM +, Peter Tribble wrote: A brief search didn't show anything relevant, so here goes: Would it be feasible to support a scrub per-filesystem rather than per-pool? The reason is that on a large system, a scrub of a pool can take excessively long (and, indeed, may never complete). Running a scrub on each filesystem allows it to be broken up into smaller chunks, which would be much easier to arrange. (For example, I could scrub one filesystem a night and not have it run into working hours.) Another reason might be that I have both busy and quiet filesystems. For the busy ones, they're regularly backed up, and the data regularly read anyway; for the quiet ones they're neither read nor backed up, so it would be nice to be able to validate those. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance lower than expected
The disks in the SAN servers were indeed striped together with Linux LVM and exported as a single volume to ZFS. That is really going to hurt. In general, you're much better off giving ZFS access to all the individual LUNs. The intermediate LVM layer kills the concurrency that's native to ZFS. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?
Nathan: yes. Flipping each bit and recomputing the checksum is not only possible, we actually did it in early versions of the code. The problem is that it's really expensive. For a 128K block, that's a million bits, so you have to re-run the checksum a million times, on 128K of data. That's 128GB of data to churn through. So Bob: you're right too. It's generally much cheaper to retry the I/O, try another disk, try a ditto block, etc. That said, when all else fails, a 128GB computation is a lot cheaper than a restore from tape. At some point it becomes a bit philosophical. Suppose the block in question is a single user data block. How much of the machine should you be willing to dedicate to getting that block back? I mean, suppose you knew that it was theoretically possible, but would consume 500 hours of CPU time during which everything else would be slower -- and the affected app's read() system call would hang for 500 hours. What is the right policy? There's no one right answer. If we were to introduce a feature like this, we'd need some admin-settable limit on how much time to dedicate to it. For some checksum functions like fletcher2 and fletcher4, it is possible to do much better than brute force because you can compute an incremental update -- that is, you can compute the effect of changing the nth bit without rerunning the entire checksum. This is, however, not possible with SHA-256 or any other secure hash. We ended up taking that code out because single-bit errors didn't seem to arise in practice, and in testing, the error correction had a rather surprising unintended side effect: it masked bugs in the code! The nastiest kind of bug in ZFS is something we call a future leak, which is when some change from txg (transaction group) 37 ends up going out as part of txg 36. It normally wouldn't matter, except if you lost power before txg 37 was committed to disk. On reboot you'd have inconsistent on-disk state (all of 36 plus random bits of 37). We developed coding practices and stress tests to catch future leaks, and as I know we've never actually shipped one. But they are scary. If you *do* have a future leak, it's not uncommon for it to be a very small change -- perhaps incrementing a counter in some on-disk structure. The thing is, if the counter is going from even to odd, that's exactly a one-bit change. The single-bit error correction logic would happily detect these and fix them up -- not at all what you want when testing! (Of course, we could turn it off during testing -- but then we wouldn't be testing it.) All that said, I'm still occasionally tempted to bring it back. It may become more relevant with flash memory as a storage medium. Jeff On Sun, Mar 02, 2008 at 05:28:48PM -0600, Bob Friesenhahn wrote: On Mon, 3 Mar 2008, Nathan Kroenert wrote: Speaking of expensive, but interesting things we could do - From the little I know of ZFS's checksum, it's NOT like the ECC checksum we use in memory in that it's not something we can use to determine which bit flipped in the event that there was a single bit flip in the data. (I could be completely wrong here... but...) It seems that the emphasis on single-bit errors may be misplaced. Is there evidence which suggests that single-bit errors are much more common than multiple bit errors? What is the chance we could put a little more resilience into ZFS such that if we do get a checksum error, we systematically flip each bit in sequence and check the checksum to see if we could in fact proceed (including writing the data back correctly.). It is easier to retry the disk read another 100 times or store the data in multiple places. Or build into the checksum something analogous to ECC so we can choose to use NON-ZFS protected disks and paths, but still have single bit flip protection... Disk drives commonly use an algorithm like Reed Solomon (http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which provides forward-error correction. This is done in hardware. Doing the same in software is likely to be very slow. What do others on the list think? Do we have enough folks using ZFS on HDS / EMC / other hardware RAID(X) environments that might find this useful? It seems that since ZFS is intended to support extremely large storage pools, available energy should be spent ensuring that the storage pool remains healthy or can be repaired. Loss of individual file blocks is annoying, but loss of entire storage pools is devastating. Since raw disk is cheap (and backups are expensive), it makes sense to write more redundant data rather than to minimize loss through exotic algorithms. Even if RAID is not used, redundant copies may be used on the same disk to help protect against block read errors. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,
Re: [zfs-discuss] Cause for data corruption?
I thought RAIDZ would correct data errors automatically with the parity data. Right. However, if the data is corrupted while in memory (e.g. on a PC with non-parity memory), there's nothing ZFS can do to detect that. I mean, not even theoretically. The best we could do would be to narrow the windows of vulnerability by recomputing the checksum every time we accessed an in-memory object, which would be terribly expensive. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] moving zfs filesystems between disks
Yes. Just say this: # zpool replace mypool disk1 disk2 This will do all the intermediate steps you'd expect: attach disk2 as a mirror of disk1, resilver, detach disk2, and grow the pool to reflect the larger size of disk1. Jeff On Wed, Feb 27, 2008 at 04:48:59PM -0800, Bill Shannon wrote: I've just started using zfs. I copied data from a ufs filesystem on disk 1 to a zfs pool/filesystem on disk 2. Can I add disk 1 as a mirror for disk 2, and then remove disk 2 from the mirror, and end up with all the data back on disk 1 in zfs (after some amount of time, of course)? If disk 1 is larger than disk 2, will the larger amount of space be available after I remove the disk 2 mirror? (Disk 2 is a full disk, but disk 1 is actually just a partition of a disk. I assume that doesn't make any difference.) Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] moving zfs filesystems between disks
Oops -- I transposed 1 and 2 in the last sentence. Corrected version, and hopefully a bit easier to read: # zpool replace mypool olddisk newdisk This will do all the intermediate steps you'd expect: attach newdisk as a mirror of olddisk, resilver, detach olddisk, and grow the pool to reflect the larger size of newdisk. Jeff On Wed, Feb 27, 2008 at 05:04:02PM -0800, Jeff Bonwick wrote: Yes. Just say this: # zpool replace mypool disk1 disk2 This will do all the intermediate steps you'd expect: attach disk2 as a mirror of disk1, resilver, detach disk2, and grow the pool to reflect the larger size of disk1. Jeff On Wed, Feb 27, 2008 at 04:48:59PM -0800, Bill Shannon wrote: I've just started using zfs. I copied data from a ufs filesystem on disk 1 to a zfs pool/filesystem on disk 2. Can I add disk 1 as a mirror for disk 2, and then remove disk 2 from the mirror, and end up with all the data back on disk 1 in zfs (after some amount of time, of course)? If disk 1 is larger than disk 2, will the larger amount of space be available after I remove the disk 2 mirror? (Disk 2 is a full disk, but disk 1 is actually just a partition of a disk. I assume that doesn't make any difference.) Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz2 resilience on 3 disks
1) If i create a raidz2 pool on some disks, start to use it, then the disks' controllers change. What will happen to my zpool? Will it be lost or is there some disk tagging which allows zfs to recognise the disks? It'll be fine. ZFS opens by path, but then checks both the devid and the on-disk vdev label, which is dispositive when the others disagree. 2) if i create a raidz2 on 3 HDs, do i have any resilience? If any one of those drives fails, do i loose everything? I've got one such pool and i'm afraid it's a ticking time bomb. You're fine. RAID-Z2 is N+2, and you have N=1. A three-way mirror would give you better performance (because there's no parity to generate), but from a resilience standpoint they're equivalent. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lost intermediate snapshot; incremental backup still possible?
I think so. On your backup pool, roll back to the last snapshot that was successfully received. Then you should be able to send an incremental between that one and the present. Jeff On Thu, Feb 07, 2008 at 08:38:38AM -0800, Ian wrote: I keep my system synchronized to a USB disk from time to time. The script works by sending incremental snapshots to a pool on the USB disk, then deleting those snapshots from the source machine. A botched script ended up deleting a snapshot that was not successfully received on the USB disk. Now, I've lost the ability to send incrementally since the intermediate snapshot is lost. From what I gather, if I try to send a full snapshot, it will require deleting and replacing the dataset on the USB disk. Is there any way around this? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue fixing ZFS corruption
The Silicon Image 3114 controller is known to corrupt data. Google for silicon image 3114 corruption to get a flavor. I'd suggest getting your data onto different h/w, quickly. Jeff On Wed, Jan 23, 2008 at 12:34:56PM -0800, Bertrand Sirodot wrote: Hi, I have been experiencing corruption on one of my ZFS pool over the last couple of days. I have tried running zpool scrub on the pool, but everytime it comes back with new files being corrupted. I would have thought that zpool scrub would have identified the corrupted files once and for all and would be fine afterwards. The feeling I have right now is that zpool scrub is actually spreading the corruption and won't stop until I have no more files on the file systems. I am running 5.11 snv_60 on an Asus M2A VM motherboard. I am using both the SATA controller on the motherboard and a Si3114 based controller. I have had the Si3114 controller for a couple of years now with no issue, that I know of. Any idea? I was trying to salvage the situation, but it looks like I am going to have to destroy the pool and recreate it. Thanks a lot in advance, Bertrand. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Issue fixing ZFS corruption
Actually s10_72, but it's not really a fix, it's a workaround for a bug in the hardware. I don't know how effective it is. Jeff On Wed, Jan 23, 2008 at 04:54:54PM -0800, Erast Benson wrote: I believe issue been fixed in snv_72+, no? On Wed, 2008-01-23 at 16:41 -0800, Jeff Bonwick wrote: The Silicon Image 3114 controller is known to corrupt data. Google for silicon image 3114 corruption to get a flavor. I'd suggest getting your data onto different h/w, quickly. Jeff On Wed, Jan 23, 2008 at 12:34:56PM -0800, Bertrand Sirodot wrote: Hi, I have been experiencing corruption on one of my ZFS pool over the last couple of days. I have tried running zpool scrub on the pool, but everytime it comes back with new files being corrupted. I would have thought that zpool scrub would have identified the corrupted files once and for all and would be fine afterwards. The feeling I have right now is that zpool scrub is actually spreading the corruption and won't stop until I have no more files on the file systems. I am running 5.11 snv_60 on an Asus M2A VM motherboard. I am using both the SATA controller on the motherboard and a Si3114 based controller. I have had the Si3114 controller for a couple of years now with no issue, that I know of. Any idea? I was trying to salvage the situation, but it looks like I am going to have to destroy the pool and recreate it. Thanks a lot in advance, Bertrand. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 recommendations for netbackup dsu?
Yep, compression is generally a nice win for backups. The amount of compression will depend on the nature of the data. If it's all mpegs, you won't see any advantage because they're already compressed. But for just about everything else, 2-3x is typical. As for hot spares, they are indeed global. Jeff On Tue, Dec 11, 2007 at 03:16:44PM -0800, Dave Lowenstein wrote: Okay, my order for an x4500 went through so sometime soon I'll be using it as a big honkin area for DSUs and DSSUs for netbackup. Does anybody have any experience with using zfs compression for this purpose? The thought of doubling 48tb to 96 tb is enticing. Are there any other zfs tweaks that might aid in performance for what will pretty much be a lot of long and large reads and writes? I'm planning on one big chunk of space for a permanently on disk DSU, and another for the DSSU staging areas. Also, I haven't looked into this but is a spare considered part of a zpool, or is there such a thing as a global spare? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Roadmap - thoughts on expanding raidz / restriping / defrag
In short, yes. The enabling technology for all of this is something we call bp rewrite -- that is, the ability to rewrite an existing block pointer (bp) to a new location. Since ZFS is COW, this would be trivial in the absence of snapshots -- just touch all the data. But because a block may appear in many snapshots, there's more to it. It's not impossible, just a bit tricky... and we're working on it. Once we have bp rewrite, many cool features will become available as trivial applications of it: on-line defrag, restripe, recompress, etc. Jeff On Mon, Dec 17, 2007 at 02:29:14AM -0800, Ross wrote: Hey folks, Does anybody know if any of these are on the roadmap for ZFS, or have any idea how long it's likely to be before we see them (we're in no rush - late 2008 would be fine with us, but it would be nice to know they're being worked on)? I've seen many people ask for the ability to expand a raid-z pool by adding devices. I'm wondering if it would be useful to work on a defrag / restriping tool to work hand in hand with this. I'm assuming that when the functionality is available, adding a disk to a raid-z set will mean the existing data stays put, and new data is written across a wider stripe. That's great for performance for new data, but not so good for the existing files. Another problem is that you can't guarantee how much space will be added. That will have to be calculated based on how much data you already have. ie: If you have a simple raid-z of five 500GB drives, you would expect adding another drive to add 500GB of space. However, if your pool is half full, you can only make use of 250GB of space, the other 250GB is going to be wasted. What I would propose to solve this is to implement a defrag / restripe utility as part of the raid-z upgrade process, making it a three step process: - New drive added to raid-z pool - Defrag tool begins restriping and defragmenting old data - Once restripe complete, pool reports the additional free space There are some limitations to this. You would maybe want to advise that expanding a raid-z pool should only be done with a reasonable amount of free disk space, and that it may take some time. It may also be beneficial to add the ability to add multiple disks in one go. However, if it works it would seem to add several benefits: - Raid-z pools can be expanded - ZFS gains a defrag tool - ZFS gains a restriping tool This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best option for my home file server?
I would keep it simple. Let's call your 250GB disks A, B, C, D, and your 500GB disks X and Y. I'd either make them all mirrors: zpool create mypool mirror A B mirror C D mirror X Y or raidz the little ones and mirror the big ones: zpool create mypool raidz A B C D mirror X Y or, as you mention, get another 500GB disk, Z, and raidz like this: zpool create mypool raidz A B C D raidz X Y Z Jeff On Wed, Sep 26, 2007 at 01:06:38PM -0700, Christopher wrote: I'm about to build a fileserver and I think I'm gonna use OpenSolaris and ZFS. I've got a 40GB PATA disk which will be the OS disk, and then I've got 4x250GB SATA + 2x500GB SATA disks. From what you are writing I would think my best option would be to slice the 500GB disks in two 250GB and then make two RAIDz with two 250 disks and one partition from each 500 disk, giving me two RAIDz of 4 slices of 250, equaling to 2 x 750GB RAIDz. How would the performance be with this? I mean, it would probably drop since I would have two raidz slices on one disk. From what I gather, I would still be able to lose one of the 500 disks (or 250) and still be able to recover, right? Perhaps I should just get another 500GB disk and run a RAIDz on the 500s and one RAIDz on the 250s? I'm also a bit of a noob when it comes to ZFS (but it looks like it's not that hard to admin) - Would I be able to join the two RAIDz together for one BIG volume altogether? And it will survive one disk failure? /Christopher This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS panic when trying to import pool
Basically, it is complaining that there aren't enough disks to read the pool metadata. This would suggest that in your 3-disk RAID-Z config, either two disks are missing, or one disk is missing *and* another disk is damaged -- due to prior failed writes, perhaps. (I know there's at least one disk missing because the failure mode is errno 6, which is EXNIO.) Can you tell from /var/adm/messages or fmdump whether there write errors to multiple disks, or to just one? Jeff On Tue, Sep 18, 2007 at 05:26:16PM -0700, Geoffroy Doucet wrote: I have a raid-z zfs filesystem with 3 disks. The disk was starting have read and write errors. The disks was so bad that I started to have trans_err. The server lock up and the server was reset. Then now when trying to import the pool the system panic. I installed the last Recommend on my Solaris U3 and also install the last Kernel patch (120011-14). But still when trying to do zpool import pool it panic. I also dd the disk and tested on another server with OpenSolaris B72 and still the same thing. Here is the panic backtrace: Stack Backtrace - vpanic() assfail3+0xb9(f7dde5f0, 6, f7dde840, 0, f7dde820, 153) space_map_load+0x2ef(ff008f1290b8, c00fc5b0, 1, ff008f128d88, ff008dd58ab0) metaslab_activate+0x66(ff008f128d80, 8000) metaslab_group_alloc+0x24e(ff008f46bcc0, 400, 3fd0f1, 32dc18000, ff008fbeaa80, 0) metaslab_alloc_dva+0x192(ff008f2d1a80, ff008f235730, 200, ff008fbeaa80, 0, 0) metaslab_alloc+0x82(ff008f2d1a80, ff008f235730, 200, ff008fbeaa80, 2 , 3fd0f1) zio_dva_allocate+0x68(ff008f722790) zio_next_stage+0xb3(ff008f722790) zio_checksum_generate+0x6e(ff008f722790) zio_next_stage+0xb3(ff008f722790) zio_write_compress+0x239(ff008f722790) zio_next_stage+0xb3(ff008f722790) zio_wait_for_children+0x5d(ff008f722790, 1, ff008f7229e0) zio_wait_children_ready+0x20(ff008f722790) zio_next_stage_async+0xbb(ff008f722790) zio_nowait+0x11(ff008f722790) dmu_objset_sync+0x196(ff008e4e5000, ff008f722a10, ff008f260a80) dsl_dataset_sync+0x5d(ff008df47e00, ff008f722a10, ff008f260a80) dsl_pool_sync+0xb5(ff00882fb800, 3fd0f1) spa_sync+0x1c5(ff008f2d1a80, 3fd0f1) txg_sync_thread+0x19a(ff00882fb800) thread_start+8() And here is the panic message buf: panic[cpu0]/thread=ff0001ba2c80: assertion failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 (0 x6 == 0x0), file: ../../common/fs/zfs/space_map.c, line: 339 ff0001ba24f0 genunix:assfail3+b9 () ff0001ba2590 zfs:space_map_load+2ef () ff0001ba25d0 zfs:metaslab_activate+66 () ff0001ba2690 zfs:metaslab_group_alloc+24e () ff0001ba2760 zfs:metaslab_alloc_dva+192 () ff0001ba2800 zfs:metaslab_alloc+82 () ff0001ba2850 zfs:zio_dva_allocate+68 () ff0001ba2870 zfs:zio_next_stage+b3 () ff0001ba28a0 zfs:zio_checksum_generate+6e () ff0001ba28c0 zfs:zio_next_stage+b3 () ff0001ba2930 zfs:zio_write_compress+239 () ff0001ba2950 zfs:zio_next_stage+b3 () ff0001ba29a0 zfs:zio_wait_for_children+5d () ff0001ba29c0 zfs:zio_wait_children_ready+20 () ff0001ba29e0 zfs:zio_next_stage_async+bb () ff0001ba2a00 zfs:zio_nowait+11 () ff0001ba2a80 zfs:dmu_objset_sync+196 () ff0001ba2ad0 zfs:dsl_dataset_sync+5d () ff0001ba2b40 zfs:dsl_pool_sync+b5 () ff0001ba2bd0 zfs:spa_sync+1c5 () ff0001ba2c60 zfs:txg_sync_thread+19a () ff0001ba2c70 unix:thread_start+8 () syncing file systems... Is there a way to restore the data? Is there a way to fsck the zpool, and correct the error manually? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
As you can see, two independent ZFS blocks share one parity block. COW won't help you here, you would need to be sure that each ZFS transaction goes to a different (and free) RAID5 row. This is I belive the main reason why poor RAID5 wasn't used in the first place. Exactly right. RAID-Z has different performance trade-offs than RAID-5, but the deciding factor was correctness. I'm really glad you're doing these experiments! It's good to know what the trade-offs are, performance-wise, between RAID-Z and classic RAID-5. At a minimum, it tells us what's on the table, and what we're paying for transactional semantics. To be honest, I'm pleased that it's only 2x. It wouldn't have surprised me if it were Nx for an N+1 configuration. A factor of 2 is something we can work with. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mysterious corruption with raidz2 vdev
I suspect this is a bug in raidz error reporting. With a mirror, each copy either checksums correctly or it doesn't, so we know which drives gave us bad data. With RAID-Z, we have to infer which drives have damage. If the number of drives returning bad data is less than or equal to the number of parity drives, we can both detect and correct the error. But if, say, three drives in a RAID-Z2 stripe return corrupt data, we have no way to know which drives are at fault -- there's just not enough information, and I mean that in the mathematical sense (fewer equations than unknowns). That said, we should enhance 'zpool status' to indicate the number of detected-but-undiagnosable errors on each RAID-Z vdev. Jeff Kevin wrote: We'll try running all of the diagnostic tests to rule out any other issues. But my question is, wouldn't I need to see at least 3 checksum errors on the individual devices in order for there to be a visible error in the top level vdev? There doesn't appear to be enough raw checksum errors on the disks for there to have been 3 errors in the same vdev block. Or am I not understanding the checksum count correctly? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raid is very slow???
A couple of questions for you: (1) What OS are you running (Solaris, BSD, MacOS X, etc)? (2) What's your config? In particular, are any of the partitions on the same disk? (3) Are you copying a few big files or lots of small ones? (4) Have you measured UFS-to-UFS and ZFS-to-ZFS performance on the same platform? That'd be useful data... Jeff On Fri, Jul 06, 2007 at 03:49:43PM -0400, Will Murnane wrote: On 7/6/07, Orvar Korvar [EMAIL PROTECTED] wrote: have set up a ZFS raidz with 4 samsung 500GB hard drives. It is extremely slow when I mount a ntfs partition and copy everything to zfs. Its like 100kb/sec or less. Why is that? How are you mounting said NTFS partition? When I copy from ZFSpool to UFS, I get like 40MB/sec - isnt it very low considering I have 4 new 500GB discs in raid? And when I copy from UFS to ZPool I get like 20MB/sec. Strange? Or normal results? Should I expect better performance? As of now, I am disappointed of ZFS. How fast is copying a file from ZFS to /dev/null? That would eliminate the UFS disk from the mix. Will ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs reports small st_size for directories?
What was the reason to make ZFS use directory sizes as the number of entries rather than the way other Unix filesystems use it? In UFS, the st_size is the size of the directory inode as though it were a file. The only reason it's like that is that UFS is sloppy and lets you cat directories -- a fine way to screw up your terminal settings, but otherwise not terribly useful. For reads (rather than readdirs) of a directory to work, st_size has to be this way. With ZFS, we decided to enforce file vs. directory semantics -- no read(2) of directories, no directory hard links (even as root), etc. What, then, should we return for st_size? We figured the number of entries would be the most useful piece of information for a sysadmin. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Multiple filesystem costs? Directory sizes?
Mario, For the reasons you mentioned, having a few different filesystems (on the order of 5-10, I'd guess) can be handy. Any time you want different behavior for different types of data, multiple filesystems are the way to go. For maximum directory size, it turns out that the practical limits aren't in ZFS -- they're in your favorite applications, like ls(1) and file browsers. ZFS won't mind if you put millions of files in a directory, but ls(1) will be painfully slow. Similarly, if you're using a mail program and you go to a big directory to grab an attachment... you'll wait and wait while it reads the first few bytes of every file in the directory to determine its type. Hope that helps, Jeff Mario Goebbels wrote: While setting up my new system, I'm wondering whether I should go with plain directories or use ZFS filesystems for specific stuff. About the cost of ZFS filesystems, I read on some Sun blog in the past about something like 64k kernel memory (or whatever) per active filesystem. What are however the additional costs? The reason I'm considering multiple filesystems is for instance easy ZFS backups and snapshots, but also tuning the recordsizes. Like storing lots of generic pictures from the web, smaller recordsizes may be appropriate to trim down the waste once the filesize surpasses the record size, aswell as using large recordsizes for video files on a seperate filesystem. Turning on and off compression and access times for performance reasons are another thing. Also, in this same message, I'd like to ask what sensible maximum directory sizes are. As in amount of files. Thanks. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS stalling problem
Jesse, This isn't a stall -- it's just the natural rhythm of pushing out transaction groups. ZFS collects work (transactions) until either the transaction group is full (measured in terms of how much memory the system has), or five seconds elapse -- whichever comes first. Your data would seem to suggest that the read side isn't delivering data as fast as ZFS can write it. However, it's possible that there's some sort of 'breathing' effect that's hurting performance. One simple experiment you could try: patch txg_time to 1. That will cause ZFS to push transaction groups every second instead of the default of every 5 seconds. If this helps (or if it doesn't), please let us know. Thanks, Jeff Jesse DeFer wrote: Hello, I am having problems with ZFS stalling when writing, any help in troubleshooting would be appreciated. Every 5 seconds or so the write bandwidth drops to zero, then picks up a few seconds later (see the zpool iostat at the bottom of this message). I am running SXDE, snv_55b. My test consists of copying a 1gb file (with cp) between two drives, one 80GB PATA, one 500GB SATA. The first drive is the system drive (UFS), the second is for data. I have configured the data drive with UFS and it does not exhibit the stalling problem and it runs in almost half the time. I have tried many different ZFS settings as well: atime=off, compression=off, checksums=off, zil_disable=1 all to no effect. CPU jumps to about 25% system time during the stalls, and hovers around 5% when data is being transferred. # zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - tank 183M 464G 0 17 1.12K 1.93M tank 183M 464G 0457 0 57.2M tank 183M 464G 0445 0 55.7M tank 183M 464G 0405 0 50.7M tank 366M 464G 0226 0 4.97M tank 366M 464G 0 0 0 0 tank 366M 464G 0 0 0 0 tank 366M 464G 0 0 0 0 tank 366M 464G 0200 0 25.0M tank 366M 464G 0431 0 54.0M tank 366M 464G 0445 0 55.7M tank 366M 464G 0423 0 53.0M tank 574M 463G 0270 0 18.1M tank 574M 463G 0 0 0 0 tank 574M 463G 0 0 0 0 tank 574M 463G 0 0 0 0 tank 574M 463G 0164 0 20.5M tank 574M 463G 0504 0 63.1M tank 574M 463G 0405 0 50.7M tank 753M 463G 0404 0 42.6M tank 753M 463G 0 0 0 0 tank 753M 463G 0 0 0 0 tank 753M 463G 0 0 0 0 tank 753M 463G 0343 0 42.9M tank 753M 463G 0476 0 59.5M tank 753M 463G 0465 0 50.4M tank 907M 463G 0 68 0 390K tank 907M 463G 0 0 0 0 tank 907M 463G 0 11 0 1.40M tank 907M 463G 0451 0 56.4M tank 907M 463G 0492 0 61.5M tank1.01G 463G 0139 0 7.94M tank1.01G 463G 0 0 0 0 Thanks, Jesse DeFer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
However, I logged in this morning to discover that the ZFS volume could not be read. In addition, it appears to have marked all drives, mirrors the volume itself as 'corrupted'. One possibility: I've seen this happen when a system doesn't shut down cleanly after the last change to the pool configuration. In this case, what can happen is that the boot archive (an annoying implementation detail of the new boot architecture) can be out of date relative to your pool. In particular, the stale boot archive may contain an old version of /etc/zfs/zpool.cache, which confuses the initial pool open. The workaround for this is simple enough: export the pool and then import it. Assuming this works, you can fix the stupid boot archive is by running 'bootadm update-archive'. Please let us know if this helps. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does running redundancy with ZFS use as much disk space as doubling drives?
On Mon, Feb 26, 2007 at 01:53:17AM -0800, Tor wrote: [...] if using redundancy on ZDF The ZFS Document Format? ;-) uses less disk space as simply getting extra drives and do identical copies, with periodic CRC checks of the source material to check the health. If you create a 2-disk mirror, then it is indeed simply two copies. But if you create, say, a 5-disk RAID-Z group, then you get 4 data disks worth of space. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Do you agree that their is a major tradeoff of builds up a wad of transactions in memory? I don't think so. We trigger a transaction group commit when we have lots of dirty data, or 5 seconds elapse, whichever comes first. In other words, we don't let updates get stale. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
That is interesting. Could this account for disproportionate kernel CPU usage for applications that perform I/O one byte at a time, as compared to other filesystems? (Nevermind that the application shouldn't do that to begin with.) No, this is entirely a matter of CPU efficiency in the current code. There are two issues; we know what they are; and we're fixing them. The first is that as we translate from znode to dnode, we throw away information along the way -- we go from znode to object number (fast), but then we have to do an object lookup to get from object number to dnode (slow, by comparison -- or more to the point, slow relative to the cost of writing a single byte). But this is just stupid, since we already have a dnode pointer sitting right there in the znode. We just need to fix our internal interfaces to expose it. The second problem is that we're not very fast at partial-block updates. Again, this is entirely a matter of code efficiency, not anything fundamental. I still would love to see something like fbarrier() defined by some standrd (de facto or otherwise) to make the distinction between ordered writes and guaranteed persistence more easily exploited in the general case for applications, and encourage filesystems/storage systems to optimize for that case (i.e., not have fbarrier() simply fsync()). Totally agree. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS vs NFS vs array caches, revisited
[b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b] Without knowing more I can only guess, but most likely it's a simple matter of working set. Suppose the benchmark in question has a 4G working set, and suppose that each LUN is fronted by a 1G cache. With a single LUN, only 1/4 of your working set fits in cache, so you're doing a fair amount of actual disk I/O. With 7 LUNs, you've got 7G of cache, so the entire benchmark fits in cache -- no disk I/O. The factor of 100x is what tells me this is almost certainly a working-set effect. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs corruption -- odd inum?
The object number is in hex. 21e282 hex is 2220674 decimal -- give that a whirl. This is all better now thanks to some recent work by Eric Kustarz: 6410433 'zpool status -v' would be more useful with filenames This was integrated into Nevada build 57. Jeff On Sat, Feb 10, 2007 at 05:18:05PM -0800, Joe Little wrote: So, I attempting to find the inode from the result of a zpool status -v: errors: The following persistent errors have been detected: DATASET OBJECT RANGE cc 21e382 lvl=0 blkid=0 Well, 21e282 appears to not be a valid number for find . -inum blah Any suggestions? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs rewrite?
On Fri, Jan 26, 2007 at 10:57:19PM -0800, Frank Cusack wrote: On January 27, 2007 12:27:17 AM -0200 Toby Thain [EMAIL PROTECTED] wrote: On 26-Jan-07, at 11:34 PM, Pawel Jakub Dawidek wrote: 3. I created file system with huge amount of data, where most of the data is read-only. I change my server from intel to sparc64 machine. Adaptive endianess only change byte order to native on write and because file system is mostly read-only, it'll need to byteswap all the time. And here comes 'zfs rewrite'! Why would this help? (Obviously file data is never 'swapped'). Metadata (incl checksums?) still has to be byte-swapped. Or would atime updates also force a metadata update? Or am I totally mistaken. You're all correct. File data is never byte-swapped. Most metadata needs to be byte-swapped, but it's generally only 1-2% of your space. So the overhead shouldn't be significant, even if you never rewrite. An atime update will indeed cause a znode rewrite (unless you run with zfs set atime=off), so znodes will get rewritten by reads. The only other non-trivial metadata is the indirect blocks. All files up to 128k are stored in a single block: ZFS has variable blocksize from 512 bytes to 128k, so a 35k file consumes exactly 35k (not, say, 40k as it would with a fixed 8k blocksize). Single-block files have no indirect blocks, and hence no metadata other than the znode. So all that remains is the indirect blocks for files larger than 128k -- which is to say, not very much. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] File Space Allocation
Where can I find information on the file allocation methodology used by ZFS? You've inspired me to blog again: http://blogs.sun.com/bonwick/entry/zfs_block_allocation I'll describe the way we manage free space in the next post. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Snapshots impact on performance
Nice, this is definitely pointing the finger more definitively. Next time could you try: dtrace -n '[EMAIL PROTECTED](20)] = count()}' -c 'sleep 5' (just send the last 10 or so stack traces) In the mean time I'll talk with our SPA experts and see if I can figure out how to fix this... By any chance is the pool fairly close to full? The fuller it gets, the harder it becomes to find long stretches of free space. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Corrupted LUN in RAIDZ group -- How to repair?
It looks like now the scrub has completed. Should I now clear these warnings? Yep. You survived the Unfortunate Event unscathed. You're golden. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: system unresponsive after issuing a zpool attach
And it started replacement/resilvering... after few minutes system became unavailbale. Reboot only gives me a few minutes, then resilvering make system unresponsible. Is there any workaroud or patch for this problem??? Argh, sorry -- the problem is that we don't do aggressive enough scrub/resilver throttling. The effect is most pronounced on 32-bit or low-memory systems. We're working on it. One thing you might try is reducing txg_time to 1 second (the default is 5 seconds) by saying this: echo txg_time/W1 | mdb -kw. Let me describe what's happening, and why this may help. When we kick off a scrub (same code path as resilver, so I'll use the term generically), we traverse the entire block tree looking for blocks that need scrubbing. The tree traversal itself is single-threaded, but the work it generates is not -- each time we find a block that needs scrubbing, we schedule an async I/O to do it. As you've discovered, we can generate work faster than the I/O subsystem can process it. To avoid overloading the disks, we throttle I/O downstream, but we don't (yet) have an upstream throttle. If we discover blocks really fast, we can end up scheduling lots of I/O -- and sitting on lots of memory -- before the downstream throttle kicks in. The reason this relates to txg_time is that every time we sync a transaction group, we suspend the scrub thread and wait for all pending scrub I/Os to complete. This ensures that we won't asynchronously scrub a block that was freed and reallocated in a future txg; when coupled with the COW nature of ZFS, this allows us to run scrubs entirely independent of all filesystem-level structure (e.g. directories) and locking rules. This little trick makes the scrubbing algorithms *much* simpler. The key point is that each spa_sync() throttles the scrub to zero. By lowering txg_time from 5 to 1, you're cutting down the maximum number of pending scrub I/Os by roughly 5x. The unresponsiveness you're seeing is a threshold effect; I'm hoping that by running spa_sync() more often, we can get you below that threshold. Please let me know if this works for you. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance using slices vs. entire disk?
ZFS will try to enable write cache if whole disks is given. Additionally keep in mind that outer region of a disk is much faster. And it's portable. If you use whole disks, you can export the pool from one machine and import it on another. There's no way to export just one slice and leave the others behind... Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance using slices vs. entire disk?
is zfs any less efficient with just using a portion of a disk versus the entire disk? As others mentioned, if we're given a whole disk (i.e. no slice is specified) then we can safely enable the write cache. One other effect -- probably not huge -- is that the block placement algorithm is most optimal for an outer-to-inner track diameter ratio of about 2:1, which reflects typical platters. To quote the source: http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/metaslab.c#m etaslab_weight /* * Modern disks have uniform bit density and constant angular velocity. * Therefore, the outer recording zones are faster (higher bandwidth) * than the inner zones by the ratio of outer to inner track diameter, * which is typically around 2:1. We account for this by assigning * higher weight to lower metaslabs (multiplier ranging from 2x to 1x). * In effect, this means that we'll select the metaslab with the most * free bandwidth rather than simply the one with the most free space. */ But like I said, the effect isn't huge -- the high-order bit that we have a preference for low LBAs. It's a second-order optimization to bias the allocation based on the maximum free bandwidth, which is currently based on an assumption about physical disk construction. In the future we'll do the smart thing and compute each metaslab's allocation bias based on its actual observed bandwidth. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS performance using slices vs. entire disk?
With all of the talk about performance problems due to ZFS doing a sync to force the drives to commit to data being on disk, how much of a benefit is this - especially for NFS? It depends. For some drives it's literally 10x. Also, if I was lucky enough to have a working prestoserv card around, would ZFS be able to take advantage of that at all? I'm working on the general lack-of-NVRAM-in-servers problem. As for using presto, I don't think it'd be too hard. We've already structured the code so that allocating intent log blocks from a different set of vdevs would be straightforward. It's probably a week's work to define the new metaslab class, new vdev type, and modify the ZIL to use it. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharing a storage array
bonus questions: any idea when hot spares will make it to S10? good question :) It'll be in U3, and probably available as patches for U2 as well. The reason for U2 patches is Thumper (x4500), because we want ZFS on Thumper to have hot spares and double-parity RAID-Z from day one. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] persistent errors - which file?
I've a non-mirrored zfs file systems which shows the status below. I saw the thread in the archives about working this out but it looks like ZFS messages have changed. How do I find out what file(s) this is? [...] errors: The following persistent errors have been detected: DATASET OBJECT RANGE LOCAL28905 3262251008-3262382080 I realize this is a bit lame, but currently the answer is: find /LOCAL -mount -inum 28905 And yes, we do indeed plan to automate this. ;-) Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss