[zfs-discuss] ZFS and Storage
Hi Now that Solaris 10 06/06 is finally downloadable I have some questions about ZFS. -We have a big storage sytem supporting RAID5 and RAID1. At the moment, we only use RAID5 (for non-solaris systems as well). We are thinking about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5 seems like overkill, an option would be to use RAID1 with RAID-Z. Then again, this is a waist of space, as it needs more disks, due to the mirroring. Later on, we might be using asynchronous replication to another storage system using SAN, even more waste of space. This looks somehow like storage virtualization as of today just doesn't work nicely together. What we need, would be the feature to use JBODs. -Does ZFS in the current version support LUN extension? With UFS, we have to zero the VTOC, and then adjust the new disk geometry. How does it look like with ZFS? -I've read the threads about zfs and databases. Still I'm not 100% convenienced about read performance. Doesn't the fragmentation of the large database files (because of the concept of COW) impact read-performance? -Does anybody have any experience in database cloning using the ZFS mechanism? What factors influence the performance, when running the cloned database in parallel? -I really like the idea to keep all needed databasefiles together, to allow fast and consistent cloning. Thanks Mika # mv Disclaimer.txt /dev/null - This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. - ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
So if you have a single thread doing open/write/close of 8K files and get 1.25MB/sec, that tells me you have something like a 6ms I/O latency. Which look reasonable also. What does iostat -x svc_t (client side) says ? 400ms seems high for the workload _and_ doesn't match my formula, so I don't like it ;-) Quick look at your script looks fine tough; but something just does not compute here. Why this formula (which applies to any NFS single threaded client app working on small files). Even if the open and write parts are infinitely fast, on close(2), NFS must insure that data is set to disk. So at a minimum every close(2) must wait 1 I/O latency. During that wait the single thread client applicationwillnot initiate the following open/write/close segment. At best you get one file output per I/O latency. The I/O latency is the one seen by the client and includes network part but that should be small compared to the physical I/O. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs list -o usage info missing 'name'
Hi Probbaly been reported a while back, but 'zfs list -o' does not list the rather useful (and obvious) 'name' property, and nor does the manpage at a quick read. snv_42. # zfs list -o missing argument for 'o' option usage: list [-rH] [-o property[,property]...] [-t type[,type]...] [filesystem|volume|snapshot] ... The following properties are supported: PROPERTY EDIT INHERIT VALUES type NO NO filesystem | volume | snapshot creation NO NO date used NO NO size availableNO NO size referenced NO NO size compressratioNO NO 1.00x or higher if compressed mounted NO NO yes | no | - origin NO NO snapshot quota YES NO size | none reservation YES NO size | none volsize YES NO size volblocksize NO NO 512 to 128k, power of 2 recordsize YES YES 512 to 128k, power of 2 mountpoint YES YES path | legacy | none sharenfsYES YES on | off | share(1M) options checksumYES YES on | off | fletcher2 | fletcher4 | sha256 compression YES YES on | off | lzjb atime YES YES on | off devices YES YES on | off execYES YES on | off setuid YES YES on | off readonlyYES YES on | off zoned YES YES on | off snapdir YES YES hidden | visible aclmode YES YES discard | groupmask | passthrough aclinherit YES YES discard | noallow | secure | passthrough Gavin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
About: -I've read the threads about zfs and databases. Still I'm not 100% convenienced about read performance. Doesn't the fragmentation of the large database files (because of the concept of COW) impact read-performance? I do need to get back to this thread. The way I am currently looking at this is this: ZFS will perform great at doing the transaction component (say the small (8K) O_DSYNC writes) because the ZIL will aggregate them in fewer larger I/Os and the block allocation will stream them to the surface. On the other hand, read streaming will require a good prefetch code (under review) to get the read performance we want. If the requirements balances random writes and read streaming, then ZFS should be right there with the best FS. If the critical requirement focuses exclusively on read streaming a file that was written randomly and, in addition, the number of spindles is limited then that is not the sweetspot of ZFS. Read performance should still scale with number of spindles. And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS Wiki?
A lesson we learned with Solaris Zones applies here to ZFS. Accomplishing high-level goals, e.g. prepare an appropriate environment for application XYZ installation (Zones) or prepare an appropriate filesystem for application XYZ data (ZFS) is different than it was before Solaris 10. For Zones, a Sun BluePrint, Solaris Containers Technology Architecture Guide, was written to begin to address this need. Fortunately, with ZFS it will be easier to determine appropriate factors and settings than it was for earlier filesystems. However, documenting the lessons learned in a wiki would be very valuable. Nathanael Burton wrote: Just some random thoughts on this... One of the initial design criteria of ZFS is that it's simple. If it's not, that was a bug... If we need tutorials to use the zfs commands, has something missed the mark? If the information that is needed to do the work is NOT in the man pages, perhaps we could look to address that... Personally, I'd prefer to read a manpage than scour the web for a tutorial that may or may not be current. hm... man zfs_tutorial? :) Nathan. Tutorial might have been the wrong word. Man pages are good for finding quick reference about specific commands, syntax, basic functionality. I understand that ZFS was built around being simple but powerful. Some users/admins have trouble seeing the big picture...putting it all together. This is where I feel the power of a wiki, or any centralized documentation space related to ZFS, could be of benefit. Also things that may not be explained in a man page such as tying other applications in with ZFS, such as NetBackup and ZFS(?), ZFS and zones, ZFS and Oracle, etc. Most of those topics wouldn't be in a man page(except the zones one), but are important topics that could be very useful. -Nate This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS Wiki?
Mike Gerdts wrote: On 6/25/06, Nathan Kroenert [EMAIL PROTECTED] wrote: Now, looking forward a bit, where does the ZFS integration with zones documentation belong? Some of it will appear in the next update to the Sun BluePrint Solaris Containers Architecture Technology Guide. How about real world replication strategies with zfs send/receive, including appropriate utility scripts? Converting UFS root to ZFS root? If the information that is needed to do the work is NOT in the man pages, perhaps we could look to address that... All of the information is in man pages. Often times, stringing man pages together to bigger concepts is too hard. Hence the general fear of man pages by UNIX newbies and some oldbies. Should every sys admin who is maintaining an Oracle (or MySQL, or...) database be forced to go through the process of determining good combinations of ZFS settings? (There aren't many settings, but there are a few.) Will people learn that there are limitations to ZFS that are not documented? Wouldn't a wiki be useful as a central repository of such knowledge? -- Jeff VICTOR Sun Microsystemsjeff.victor @ sun.com OS AmbassadorSr. Technical Specialist Solaris 10 Zones FAQ:http://www.opensolaris.org/os/community/zones/faq -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
On Jun 26, 2006, at 1:15 AM, Mika Borner wrote: Hi Now that Solaris 10 06/06 is finally downloadable I have some questions about ZFS. -We have a big storage sytem supporting RAID5 and RAID1. At the moment, we only use RAID5 (for non-solaris systems as well). We are thinking about using ZFS on those LUNs instead of UFS. As ZFS on Hardware RAID5 seems like overkill, an option would be to use RAID1 with RAID-Z. Then again, this is a waist of space, as it needs more disks, due to the mirroring. Later on, we might be using asynchronous replication to another storage system using SAN, even more waste of space. This looks somehow like storage virtualization as of today just doesn't work nicely together. What we need, would be the feature to use JBODs. If you've got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn't go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized hardware - SAN-based hardware products allow sharing of storage among multiple hosts. This allows storage to be utilized more effectively. -Does ZFS in the current version support LUN extension? With UFS, we have to zero the VTOC, and then adjust the new disk geometry. How does it look like with ZFS? I don't understand what you're asking. What problem is solved by zeroing the vtoc? -I've read the threads about zfs and databases. Still I'm not 100% convenienced about read performance. Doesn't the fragmentation of the large database files (because of the concept of COW) impact read-performance? This is discussed elsewhere in the zfs discussion group. -Does anybody have any experience in database cloning using the ZFS mechanism? What factors influence the performance, when running the cloned database in parallel? -I really like the idea to keep all needed databasefiles together, to allow fast and consistent cloning. Thanks Mika # mv Disclaimer.txt /dev/null -- --- This message is intended for the addressee only and may contain confidential or privileged information. If you are not the intended receiver, any disclosure, copying to any person or any action taken or omitted to be taken in reliance on this e-mail, is prohibited and may be un- lawful. You must therefore delete this e-mail. Internet communications may not be secure or error-free and may contain viruses. They may be subject to possible data corruption, accidental or on purpose. This e-mail is not and should not be construed as an offer or the solicitation of an offer to purchase or subscribe or sell or redeem any investments. -- --- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss - Gregory Shaw, IT Architect Phone: (303) 673-8273Fax: (303) 673-2773 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 [EMAIL PROTECTED] (work) Louisville, CO 80028-4382[EMAIL PROTECTED] (home) When Microsoft writes an application for Linux, I've Won. - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: where has all my space gone? (with zfs mountroot + b38)
James C. McPherson wrote: James C. McPherson wrote: Jeff Bonwick wrote: 6420204 root filesystem's delete queue is not running The workaround for this bug is to issue to following command... # zfs set readonly=off pool/fs_name This will cause the delete queue to start up and should flush your queue. Thanks for the update. James, please let us know if this solves your problem. yes, I've tried that several times and it didn't work for me at all. One thing that worked a *little* bit was to set readonly=on, then go in with mdb -kw and set the drained flag on root_pool to 0 and then re-set readonly=off. But that only freed up about 2Gb. Here's the next installment in the saga. I bfu'd to include Mark's recent putback, rebooted, re-ran the set readonly=off op on the root pool and root filesystem, and waited. Nothing. Nada. Not a sausage. Here's my root filesystem delete head: ::fsinfo ! head -2 VFSP FS MOUNT fbcaa4e0 zfs / fbcaa4e0::print struct vfs vfs_data |::print struct zfsvfs z_delete_head { z_delete_head.z_mutex = { _opaque = [ 0 ] } z_delete_head.z_cv = { _opaque = 0 } z_delete_head.z_quiesce_cv = { _opaque = 0 } z_delete_head.z_drained = 0x1 z_delete_head.z_draining = 0 z_delete_head.z_thread_target = 0 z_delete_head.z_thread_count = 0 z_delete_head.z_znode_count = 0x5ce4 z_delete_head.z_znodes = { list_size = 0xc0 list_offset = 0x10 list_head = { list_next = 0x9232ded0 list_prev = 0xfe820d2c16b0 } } } I also went in with mdb -kw and set z_drained to 0, then re-set the readonly flag... still nothing. Pool usage is now up to ~93%, and a zdb run shows lots of leaked space too: [snip bazillions of entries re leakage] block traversal size 273838116352 != alloc 274123164672 (leaked 285048320) bp count: 5392224 bp logical:454964635136 avg: 84374 bp physical: 272756334592 avg: 50583compression: 1.67 bp allocated: 273838116352 avg: 50783compression: 1.66 SPA allocated: 274123164672 used: 91.83% Blocks LSIZE PSIZE ASIZE avgcomp %Total Type 3 48.0K 8K 24.0K 8K6.00 0.00 L1 deferred free 5 44.0K 14.5K 37.0K 7.40K3.03 0.00 L0 deferred free 8 92.0K 22.5K 61.0K 7.62K4.09 0.00 deferred free 1512 512 1K 1K1.00 0.00 object directory 3 1.50K 1.50K 3.00K 1K1.00 0.00 object array 116K 1.50K 3.00K 3.00K 10.67 0.00 packed nvlist - - - - - -- packed nvlist size 116K 1K 3.00K 3.00K 16.00 0.00 L1 bplist 116K 16K 32K 32K1.00 0.00 L0 bplist 232K 17.0K 35.0K 17.5K1.88 0.00 bplist - - - - - -- bplist header - - - - - -- SPA space map header 140 2.19M364K 1.06M 7.79K6.16 0.00 L1 SPA space map 5.01K 20.1M 15.4M 30.7M 6.13K1.31 0.01 L0 SPA space map 5.15K 22.2M 15.7M 31.8M 6.17K1.42 0.01 SPA space map 1 28.0K 28.0K 28.0K 28.0K1.00 0.00 ZIL intent log 232K 2K 6.00K 3.00K 16.00 0.00 L6 DMU dnode 232K 2K 6.00K 3.00K 16.00 0.00 L5 DMU dnode 232K 2K 6.00K 3.00K 16.00 0.00 L4 DMU dnode 232K 2.50K 7.50K 3.75K 12.80 0.00 L3 DMU dnode 15 240K 50.5K152K 10.1K4.75 0.00 L2 DMU dnode 594 9.28M 3.88M 11.6M 20.1K2.39 0.00 L1 DMU dnode 68.7K 1.07G274M549M 7.99K4.00 0.21 L0 DMU dnode 69.3K 1.08G278M561M 8.09K3.98 0.21 DMU dnode 3 3.00K 1.50K 4.50K 1.50K2.00 0.00 DMU objset - - - - - -- DSL directory 3 1.50K 1.50K 3.00K 1K1.00 0.00 DSL directory child map 2 1K 1K 2K 1K1.00 0.00 DSL dataset snap map 5 64.5K 7.50K 15.0K 3.00K8.60 0.00 DSL props - - - - - -- DSL dataset - - - - - -- ZFS znode - - - - - -- ZFS ACL 2.82K 45.1M 2.93M 5.85M 2.08K 15.41 0.00 L2 ZFS plain file 564K 8.81G612M 1.19G 2.17K 14.76 0.47 L1 ZFS plain file 4.40M 414G253G253G 57.5K1.6399.21 L0 ZFS plain file 4.95M 422G254G254G 51.4K1.6799.68 ZFS plain file 116K 1K 3.00K 3.00K 16.00 0.00 L2 ZFS directory
Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS
Robert Milkowski wrote On 06/25/06 04:12,: Hello Neil, Saturday, June 24, 2006, 3:46:34 PM, you wrote: NP Chris, NP The data will be written twice on ZFS using NFS. This is because NFS NP on closing the file internally uses fsync to cause the writes to be NP committed. This causes the ZIL to immediately write the data to the intent log. NP Later the data is also written committed as part of the pools transaction group NP commit, at which point the intent block blocks are freed. NP It does seem inefficient to doubly write the data. In fact for blocks NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) NP we write the data block and also an intent log record with the block pointer. NP During txg commit we link this block into the pool tree. By experimentation NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K NP they do not benefit from this. Is 32KB easily tuned (mdb?)? I'm not sure. NFS folk? I guess not but perhaps. And why only for blocks larger than zfs_immediate_write_sz? When data is large enough (currently 32K) it's more efficient to directly write the block, and additionally save the block pointer in a ZIL record. Otherwise it's more efficient to copy the data into a large log block potentially along with other writes. -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] status question regarding sol10u2
I had the same problem.On 6/26/06, Shannon Roddy [EMAIL PROTECTED] wrote: Noel Dellofano wrote: Solaris 10u2 was released today.You can now download it from here: http://www.sun.com/software/solaris/get.jsp Seems the download links are dead except for x86-64.No Sparc downloads.___zfs-discuss mailing listzfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] status question regarding sol10u2
Noel Dellofano wrote: Solaris 10u2 was released today. You can now download it from here: http://www.sun.com/software/solaris/get.jsp Seems the download links are dead except for x86-64. No Sparc downloads. Everything works perfectly. $ ls -1 sol-10-u2-ga-sparc-lang-iso.zip sol-10-u2-ga-sparc-lang.iso sol-10-u2-ga-sparc-v1-iso.zip sol-10-u2-ga-sparc-v1.iso sol-10-u2-ga-sparc-v2-iso.zip sol-10-u2-ga-sparc-v2.iso sol-10-u2-ga-sparc-v3-iso.zip sol-10-u2-ga-sparc-v3.iso sol-10-u2-ga-sparc-v4-iso.zip sol-10-u2-ga-sparc-v4.iso sol-10-u2-ga-sparc-v5-iso.zip sol-10-u2-ga-sparc-v5.iso I have the x86 CDROM's also. A quick set of links is at the top of the page at Blastwave : www.blastwave.org -- Dennis Clarke ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Solaris 10 6/06 now available for download
Shannon Roddy wrote: Noel Dellofano wrote: Solaris 10u2 was released today. You can now download it from here: http://www.sun.com/software/solaris/get.jsp Seems the download links are dead except for x86-64. No Sparc downloads. There were some problems getting the links set up on the Sun download center, which should all be sorted out now. Have at it... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Roch wrote: And, ifthe load can accomodate a reorder, to get top per-spindle read-streaming performance, a cp(1) of the file should do wonders on the layout. but there may not be filesystem space for double the data. Sounds like there is a need for a zfs-defragement-file utility perhaps? Or if you want to be politically cagey about naming choice, perhaps, zfs-seq-read-optimize-file ? :-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS and Storage
If you've got hardware raid-5, why not just run regular (non-raid) pools on top of the raid-5? I wouldn't go back to JBOD. Hardware arrays offer a number of advantages to JBOD: - disk microcode management - optimized access to storage - large write caches - RAID computation can be done in specialized d hardware - SAN-based hardware products allow sharing of f storage among multiple hosts. This allows storage to be utilized more effectively. I'm a little confused by the first poster's message as well, but you lose some benefits of ZFS if you don't create your pools with either RAID1 or RAIDZ, such as data corruption detection. The array isn't going to detect that because all it knows about are blocks. -Nate This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. Keep in mind that each disk data block is accompanied by a pretty long error correction code (ECC) which allows for (a) verification of data integrity (b) repair of lost/misread bits (typically up to about 10% of the block data). Therefore, in case of single block errors there are several possible situations: - non-recoverable errors - the amount of correct bits in the combined data + ECC in insufficient - such errors are visible to the RAID controller, the controller can use a redundant copy of the data, and the controller can perform the repair - recoverable errors - some bits can't be read correctly but they can be reconstructed using ECC - these errors are not directly visible to either the RAID controller or ZFS. However, the disks keep the count of recoverable errors so disk scrubbers can identify disk areas with rotten blocks and force block relocation - silent data corruption - it can happen in memory before the data was written to disk, it can occur in the disk cache, it can be caused by a bug in disk firmware. Here the disk controller can't do anything and the end-to-end checksums, which ZFS offers, are the only solution. -- Olaf ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Gregory Shaw wrote: On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote: How would ZFS self heal in this case? You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. If you've got requirements for surviving an array failure, the recommended solution in that case is to mirror between volumes on multiple arrays. I've always liked software raid (mirroring) in that case, as no manual intervention is needed in the event of an array failure. Mirroring between discrete arrays is usually reserved for mission-critical applications that cost thousands of dollars per hour in downtime. In other words, it won't. You've spent the disk space, but because you're mirroring in the wrong place (the raid array) all ZFS can do is tell you that your data is gone. With luck, subsequent reads _might_ get the right data, but maybe not. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Storage
Olaf Manczak wrote: Eric Schrock wrote: On Mon, Jun 26, 2006 at 05:26:24PM -0600, Gregory Shaw wrote: You're using hardware raid. The hardware raid controller will rebuild the volume in the event of a single drive failure. You'd need to keep on top of it, but that's a given in the case of either hardware or software raid. True for total drive failure, but not there are a more failure modes than that. With hardware RAID, there is no way for the RAID controller to know which block was bad, and therefore cannot repair the block. With RAID-Z, we have the integrated checksum and can do combinatorial analysis to know not only which drive was bad, but what the data _should_ be, and can repair it to prevent more corruption in the future. Keep in mind that each disk data block is accompanied by a pretty long error correction code (ECC) which allows for (a) verification of data integrity (b) repair of lost/misread bits (typically up to about 10% of the block data). AFAIK, typical disk ECC will correct 8 bytes. I'd love for it to be 10% (51 bytes). Do you have a pointer to such information? Therefore, in case of single block errors there are several possible situations: - non-recoverable errors - the amount of correct bits in the combined data + ECC in insufficient - such errors are visible to the RAID controller, the controller can use a redundant copy of the data, and the controller can perform the repair - recoverable errors - some bits can't be read correctly but they can be reconstructed using ECC - these errors are not directly visible to either the RAID controller or ZFS. However, the disks keep the count of recoverable errors so disk scrubbers can identify disk areas with rotten blocks and force block relocation - silent data corruption - it can happen in memory before the data was written to disk, it can occur in the disk cache, it can be caused by a bug in disk firmware. Here the disk controller can't do anything and the end-to-end checksums, which ZFS offers, are the only solution. Another mode occurs when you use a format(1m)-like utility to scan and repair disks. For such utilities, if the data cannot be reconstructed it is zero-filled. If there was real data stored there, then ZFS will detect it and the majority of other file systems will not detect it. For an array, one should not be able to readily access such utilities, and cause such corrective actions, but I would not bet the farm on it -- end-to-end error detection will always prevail. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss