[zfs-discuss] NFS and ZFS, a fine combination
Just posted: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine Performance, Availability Architecture Engineering Roch BourbonnaisSun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L'Europe, 38330, Montbonnot Saint Martin, France http://icncweb.france/~rbourbon http://blogs.sun.com/roch [EMAIL PROTECTED] (+33).4.76.18.83.20 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
Roch - PAE wrote: Just posted: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine Nice article. Now what about when we do this with more than one disk and compare UFS/SVM or VxFS/VxVM with ZFS as the back end - all with JBOD storage ? How then does ZFS compare as an NFS server ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Is this expected behavior? Assuming concurrent reads (not synchronous and sequential) I would naively expect an ndisk raidz2 pool to have a normalized performance of n for small reads. q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942tstart=0 where such behavior in a hardware RAID array lead to corruption which was detected by ZFS. No free lunch today, either. -- richard I appreciate the advantage of checksumming, believe me. Though I don't see why this is directly related to the small read problem, other than that the implementation is such. Is there some fundamental reason why one could not (though I understand one *would* not) keep a checksum on a per-disk basis, so that in the normal case one really could read from just one disk, for a small read? I realize it is not enough for a block to be self-consistent, but theoretically couldn't the block which points to the block in question contain multiple checksums for the various subsets on different disks, rather than just the one checksum for the entire block? Not that I consider this a major issue; but since you pointed me to that article in response to my statement above... -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Distributed FS
Hi, Is ZFS comparable to PVFS2? Could it also be used as an distributed filesystem at the moment or are there any plans for this in the future? Thanks and best regards, Ivan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
Sanjeev, Could you point me in the right direction as to how to convert the following GCC compile flags to Studio 11 compile flags? Any help is greatly appreciated. We're trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. Thank you very much in advance! -felide-constructors -fno-exceptions -fno-rtti Best Regards, Jason On 1/7/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote: Jason, There is no documented way of limiting the memory consumption. The ARC section of ZFS tries to adapt to the memory pressure of the system. However, in your case probably it is not quick enough I guess. One way of limiting the memory consumption would be limit the arc.c_max This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than memory available). This is done when the ZFS is loaded (arc_init()). You should be able to change the value of arc.c_max through mdb and set it to the value you want. Exercise caution while setting it. Make sure you don't have active zpools during this operation. Thanks and regards, Sanjeev. Jason J. W. Williams wrote: Hello, Is there a way to set a max memory utilization for ZFS? We're trying to debug an issue where the ZFS is sucking all the RAM out of the box, and its crashing MySQL as a result we think. Will ZFS reduce its cache size if it feels memory pressure? Any help is greatly appreciated. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel:x27521 +91 80 669 27521 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. I am writing to see if I can get some feedback from people that know the code better than I -- are there any gotchas in my logic? Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Deletes: normal process, except verifying that the number_of_links count is lowered and if it is non zero then do not free the block. clean up checksum tree as needed. What this requires: A new flag in metadata that can tag the block as a CAS block. A checksum tree that allows easy fast lookup of checksum keys. a counter in the metadata or hash tree that tracks links back to blocks. Some additions to the userland apps to push the config/enable modes. Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? That is correct. ZFS, with or without the ZIL, will *always* maintain consistent on-disk state and will *always* preserve the ordering of events on-disk. That is, if an application makes two changes to the filesystem, first A, then B, ZFS will *never* show B on-disk without also showing A. This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). This would be nice, simply to make it easier to do apples-to-apples comparisons with other NFS server implementations that don't honor the correct semantics (Linux, I'm looking at you). --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10
Peter Schuller wrote: Is this expected behavior? Assuming concurrent reads (not synchronous and sequential) I would naively expect an ndisk raidz2 pool to have a normalized performance of n for small reads. q.v. http://www.opensolaris.org/jive/thread.jspa?threadID=20942tstart=0 where such behavior in a hardware RAID array lead to corruption which was detected by ZFS. No free lunch today, either. -- richard I appreciate the advantage of checksumming, believe me. Though I don't see why this is directly related to the small read problem, other than that the implementation is such. Is there some fundamental reason why one could not (though I understand one *would* not) keep a checksum on a per-disk basis, so that in the normal case one really could read from just one disk, for a small read? I realize it is not enough for a block to be self-consistent, but theoretically couldn't the block which points to the block in question contain multiple checksums for the various subsets on different disks, rather than just the one checksum for the entire block? Then you would need to keep checksums for each physical block, which is not part of the on-disk spec.It is not clear to me that this would be a net win, because you would need that checksum to be physically placed on another vdev, which implies that you still couldn't just read a single block and be happy. Note, there are lots of different possibilities here, ZFS implements the end-to-end checksum which would not be replaced by a lower level checksum anyway. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? That is correct. ZFS, with or without the ZIL, will *always* maintain consistent on-disk state and will *always* preserve the ordering of events on-disk. That is, if an application makes two changes to the filesystem, first A, then B, ZFS will *never* show B on-disk without also showing A. So then, this begs the question Why do I want this ZIL animal at all? This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). This would be nice, simply to make it easier to do apples-to-apples comparisons with other NFS server implementations that don't honor the correct semantics (Linux, I'm looking at you). is that a glare or a leer or a sneer ? :-) dc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
Peter Schuller wrote: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? See this blog that Roch pointed to: http://blogs.sun.com/erickustarz/entry/zil_disable See the sentence: Note: disabling the ZIL does NOT compromise filesystem integrity. Disabling the ZIL does NOT cause corruption in ZFS. This makes it more reasonable to actually disable the zil. But still, personally I would like to be able to tell the NFS server to simply not be standards compliant, so that I can keep the correct semantics on the lower layer (ZFS), and disable the behavior at the level where I actually want it disabled (the NFS server). This discussion belongs on the nfs-discuss alias. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
Hans-Juergen Schnitzer writes: Roch - PAE wrote: Just posted: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine Which role plays network latency? If I understand you right, even a low-latency network, e.g. Infiniband, would not increase performance substantially since the main bottleneck is that the NFS server always has to write data to stable storage. Is that correct? Hans Schnitzer For this load, network latency plays a role as long as it is of the same order of magnitude to the I/O latency. Once network latency gets much smaller than I/O latency then network latency becomes pretty much irrelevant. At times both are of the same order of magnitude and both must be taken into account. So if your storage is NVRAM based or is far away, then network latency may still be very much at play. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
Hans-Juergen Schnitzer wrote: Roch - PAE wrote: Just posted: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine Which role plays network latency? If I understand you right, even a low-latency network, e.g. Infiniband, would not increase performance substantially since the main bottleneck is that the NFS server always has to write data to stable storage. Is that correct? Correct. You can essentially simulate the NFS semantics by doing a fsync after every file creation and before every close on a local tar extraction. eric Hans Schnitzer ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
On 8-Jan-07, at 11:54 AM, Jason J. W. Williams wrote: ...We're trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. If you're using the Enterprise release, can't you get MySQL's assistance with this? --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] hard-hang on snapshot rename
[Initial version of this message originally sent to zfs-interest by mistake. Sorry if this appears anywhere as a duplicate.] I was noodling around with creating a backup script for my home system, and I ran into a problem that I'm having a little trouble diagnosing. Has anyone seen anything like this or have any debug advice? I did a zfs create -r to set a snapshot on all of the members of a given pool. Later, for reasons that are probably obscure, I wanted to rename that snapshot. There's no zfs rename -r function, so I tried to write a crude one on my own: zfs list -rHo name -t filesystem pool | while read name; do zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] done The results were disappointing. The system was extremely busy for a moment and then went completely catatonic. Most network traffic appeared to stop, though I _think_ network driver interrupts were still working. The keyboard and mouse (traditional PS/2 types; not USB) went dead -- not even keyboard lights were working (nothing from Caps Lock). The disk light stopped flashing and went dark. The CPU temperature started to climb (as measured by an external sensor). No messages were written to /var/adm/messages or dmesg on reboot. The system turned into an increasingly warm brick. As all of my inputs to the system were gone, I really had no good way immediately available to debug the problem. Thinking this was just a fluke or perhaps something induced by hardware, I shut everything down, cooled off, and tried again. Three times. The same thing happened each time. System details: - snv_55 - Tyan 2885 motherboard with 4GB RAM (four 1GB modules) and one Opteron 246 (model 5 step 8). - AMI BIOS version 080010, dated 06/14/2005. No tweaks applied, system is always on; no power management. - Silicon Image 3114 SATA controller configured for legacy (not RAID) mode. - Three SATA disks in the system, no IDE as they've gone to the great bit-bucket in the sky. The SATA drives are one WDC WD740GD-32F (not part of this ZFS pool), and a pair of ST3250623NS. - The two Seagate drives are partitioned like this: 0 rootwm 3 - 6555.00GB(653/0/0)10490445 1 swapwm 656 - 9162.00GB(261/0/0) 4192965 2 backupwu 0 - 30397 232.86GB(30398/0/0) 488343870 3 reservedwm 917 - 9177.84MB(1/0/0) 16065 4 unassignedwu 00 (0/0/0) 0 5 unassignedwu 00 (0/0/0) 0 6 unassignedwu 00 (0/0/0) 0 7 homewm 918 - 30397 225.83GB(29480/0/0) 473596200 8 bootwu 0 - 07.84MB(1/0/0) 16065 9 alternateswm 1 - 2 15.69MB(2/0/0) 32130 - For both disks: slice 0 is for an SVM mirrored root, slice 1 has swap, slice 3 has the SVM metadata, and slice 7 is in the ZFS pool named pool as a mirror. No, I'm not using whole-disk or EFI. - Zpool status: pool: pool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 mirrorONLINE 0 0 0 c4d0s7 ONLINE 0 0 0 c4d1s7 ONLINE 0 0 0 - 'zfs list -rt filesystem pool | wc -l' says 37. - Iostat -E doesn't show any errors of any kind on the drives. - I read through CR 6421427, but that seems to be SPARC-only. Next step will probably be to set the 'snooping' flag and maybe hack the bge driver to do an abort_sequence_enter() call on a magic packet so that I can wrest control back. Before I do something that drastic, does anyone else have ideas? -- James Carlson, Solaris Networking [EMAIL PROTECTED] Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Puzzling ZFS behavior with COMPRESS option
Our setup: - E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06) - 2 2Gbps FC HBA - EMC DMX storage - 50 x 64GB LUNs configured in 1 ZFS pool - Many filesystems created with COMPRESS enabled; specifically I've one that is 768GB I'm observing the following puzzling behavior: - We are currently creating a large (1.4TB) and sparse dataset; most of the dataset contains repeating blanks (default/standard SAS dataset behavior.) - ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk usage at around 65GB. - My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and I've confirmed the same with iostat. This is very confusing - ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB) - [b]Why on God's green earth am I observing such high I/O when indeed ZFS is compressing?[/b] I can't believe that the program is actually generating I/O at the rate of (150MB/S * compressratio). Any thoughts? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Quick update, since my original post I've confirmed via DTrace (rwtop script in toolkit) that the application is not generating 150MB/S * compressratio of I/O. What then is causing this much I/O in our system? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Adding disk to a RAID-Z?
I want to setup a ZFS server with RAID-Z. Right now I have 3 disks. In 6 months, I want to add a 4th drive and still have everything under RAID-Z without a backup/wipe/restore scenario. Is this possible? I've used NetApps in the past (1996 even!) and they do it. I think they're using RAID4. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Puzzling ZFS behavior with COMPRESS option
Anantha N. Srirama wrote On 01/08/07 13:04,: Our setup: - E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06) - 2 2Gbps FC HBA - EMC DMX storage - 50 x 64GB LUNs configured in 1 ZFS pool - Many filesystems created with COMPRESS enabled; specifically I've one that is 768GB I'm observing the following puzzling behavior: - We are currently creating a large (1.4TB) and sparse dataset; most of the dataset contains repeating blanks (default/standard SAS dataset behavior.) - ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk usage at around 65GB. - My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and I've confirmed the same with iostat. This is very confusing - ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB) - [b]Why on God's green earth am I observing such high I/O when indeed ZFS is compressing?[/b] I can't believe that the program is actually generating I/O at the rate of (150MB/S * compressratio). Any thoughts? One possibility is that the data is written synchronously (uses O_DSYNC, fsync, etc), and so the ZFS Intent Log (ZIL) will write that uncompressed data to stable storage in case of a crash/power fail before the txg is committed. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard-hang on snapshot rename
I was noodling around with creating a backup script for my home system, and I ran into a problem that I'm having a little trouble diagnosing. Has anyone seen anything like this or have any debug advice? I did a zfs create -r to set a snapshot on all of the members of a given pool. Later, for reasons that are probably obscure, I wanted to rename that snapshot. There's no zfs rename -r function, so I tried to write a crude one on my own: do you mean zfs snapshot -r fsname@foo instead of the create? zfs list -rHo name -t filesystem pool | while read name; do zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] done hmm, just to verify sanity, have can you show the output of: zfs list -rHo name -t filesystem pool and zfs list -rHo name -t filesystem pool | while read name; do echo zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] done (note the echo inserted above) The results were disappointing. The system was extremely busy for a moment and then went completely catatonic. Most network traffic appeared to stop, though I _think_ network driver interrupts were still working. The keyboard and mouse (traditional PS/2 types; not USB) went dead -- not even keyboard lights were working (nothing from Caps Lock). The disk light stopped flashing and went dark. The CPU temperature started to climb (as measured by an external sensor). No messages were written to /var/adm/messages or dmesg on reboot. The system turned into an increasingly warm brick. As all of my inputs to the system were gone, I really had no good way immediately available to debug the problem. Thinking this was just a fluke or perhaps something induced by hardware, I shut everything down, cooled off, and tried again. Three times. The same thing happened each time. System details: - snv_55 - Tyan 2885 motherboard with 4GB RAM (four 1GB modules) and one Opteron 246 (model 5 step 8). - AMI BIOS version 080010, dated 06/14/2005. No tweaks applied, system is always on; no power management. - Silicon Image 3114 SATA controller configured for legacy (not RAID) mode. - Three SATA disks in the system, no IDE as they've gone to the great bit-bucket in the sky. The SATA drives are one WDC WD740GD-32F (not part of this ZFS pool), and a pair of ST3250623NS. - The two Seagate drives are partitioned like this: 0 rootwm 3 - 6555.00GB(653/0/0) 10490445 1 swapwm 656 - 9162.00GB(261/0/0) 4192965 2 backupwu 0 - 30397 232.86GB(30398/0/0) 488343870 3 reservedwm 917 - 9177.84MB(1/0/0) 16065 4 unassignedwu 00 (0/0/0) 0 5 unassignedwu 00 (0/0/0) 0 6 unassignedwu 00 (0/0/0) 0 7 homewm 918 - 30397 225.83GB(29480/0/0) 473596200 8 bootwu 0 - 07.84MB(1/0/0) 16065 9 alternateswm 1 - 2 15.69MB(2/0/0) 32130 - For both disks: slice 0 is for an SVM mirrored root, slice 1 has swap, slice 3 has the SVM metadata, and slice 7 is in the ZFS pool named pool as a mirror. No, I'm not using whole-disk or EFI. - Zpool status: pool: pool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM poolONLINE 0 0 0 mirrorONLINE 0 0 0 c4d0s7 ONLINE 0 0 0 c4d1s7 ONLINE 0 0 0 - 'zfs list -rt filesystem pool | wc -l' says 37. - Iostat -E doesn't show any errors of any kind on the drives. - I read through CR 6421427, but that seems to be SPARC-only. Next step will probably be to set the 'snooping' flag and maybe hack the bge driver to do an abort_sequence_enter() call on a magic packet so that I can wrest control back. Before I do something that drastic, does anyone else have ideas? -- James Carlson, Solaris Networking [EMAIL PROTECTED] Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
[EMAIL PROTECTED] wrote: I have been looking at zfs source trying to get up to speed on the internals. One thing that interests me about the fs is what appears to be a low hanging fruit for block squishing CAS (Content Addressable Storage). I think that in addition to lzjb compression, squishing blocks that contain the same data would buy a lot of space for administrators working in many common workflows. I am writing to see if I can get some feedback from people that know the code better than I -- are there any gotchas in my logic? Assumptions: SHA256 hash used (Fletcher2/4 have too many collisions, SHA256 is 2^128 if I remember correctly) SHA256 hash is taken on the data portion of the block as it exists on disk. the metadata structure is hashed separately. In the current metadata structure, there is a reserved bit portion to be used in the future. Description of change: Creates: The filesystem goes through its normal process of writing a block, and creating the checksum. Before the step where the metadata tree is pushed, the checksum is checked against a global checksum tree to see if there is any match. If match exists, insert a metadata placeholder for the block, that references the already existing block on disk, increment a number_of_links pointer on the metadata blocks to keep track of the pointers pointing to this block. free up the new block that was written and check-summed to be used in the future. else if no match, update the checksum tree with the new checksum and continue as normal. Deletes: normal process, except verifying that the number_of_links count is lowered and if it is non zero then do not free the block. clean up checksum tree as needed. What this requires: A new flag in metadata that can tag the block as a CAS block. A checksum tree that allows easy fast lookup of checksum keys. a counter in the metadata or hash tree that tracks links back to blocks. Some additions to the userland apps to push the config/enable modes. Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: http://infohost.nmt.edu/~val/review/hash.pdf - Bart Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Limit ZFS Memory Utilization
We're not using the Enterprise release, but we are working with them. It looks like MySQL is crashing due to lack of memory. -J On 1/8/07, Toby Thain [EMAIL PROTECTED] wrote: On 8-Jan-07, at 11:54 AM, Jason J. W. Williams wrote: ...We're trying to recompile MySQL to give a stacktrace and core file to track down exactly why its crashing...hopefully it will illuminate if memory truly is the issue. If you're using the Enterprise release, can't you get MySQL's assistance with this? --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard-hang on snapshot rename
[EMAIL PROTECTED] writes: I was noodling around with creating a backup script for my home system, and I ran into a problem that I'm having a little trouble diagnosing. Has anyone seen anything like this or have any debug advice? I did a zfs create -r to set a snapshot on all of the members of a given pool. Later, for reasons that are probably obscure, I wanted to rename that snapshot. There's no zfs rename -r function, so I tried to write a crude one on my own: do you mean zfs snapshot -r fsname@foo instead of the create? Yes; sorry. A bit of a typo there. hmm, just to verify sanity, have can you show the output of: zfs list -rHo name -t filesystem pool and zfs list -rHo name -t filesystem pool | while read name; do echo zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] done (note the echo inserted above) Sure, but it's not a shell problem. I should have mentioned that when I brought the system back up, *most* of the renames had actually taken place, but not *all* of them. I ended up with mostly [EMAIL PROTECTED], but with a handful of stragglers near the end of the list [EMAIL PROTECTED] The output looks a bit like this (not _all_ file systems shown, but representative ones): pool pool/HTSData pool/apache pool/client pool/csw pool/home pool/home/benjamin pool/home/beth pool/home/carlsonj pool/home/ftp pool/laptop pool/local pool/music pool/photo pool/sys pool/sys/core pool/sys/dhcp pool/sys/mail pool/sys/named And then: zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] It's not a matter of the shell script not working; it's a matter of something inside the kernel (perhaps not even ZFS but instead a driver related to SATA?) experiencing vapor-lock. Other heavy load on the system, though, doesn't cause this to happen. This one operation does cause the lock-up. -- James Carlson, Solaris Networking [EMAIL PROTECTED] Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Given that nobody knows how to find sha256 collisions, you'd of course need to test this code with a weaker hash algorithm. (It would almost be worth it to have the code panic in the event that a real sha256 collision was found) - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: Sure, that makes sense. I do not see why that would be much of a problem beyond if sha256 hash match, then do yet one more crypto hash of your choice to verify they are indeed the same blocks (fool me once, shame on me...), the hash key should be able to be based on only the sha256 marker then. If we do find a natural collision, then a special code path (and email to nsa =) could be in order. http://infohost.nmt.edu/~val/review/hash.pdf - Bart Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
[EMAIL PROTECTED] wrote: Does this seem feasible? Are there any blocking points that I am missing or unaware of? I am just posting this for discussion, it seems very interesting to me. Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Val Henson wrote a paper on this topic; there's a copy here: Sure, that makes sense. I do not see why that would be much of a problem beyond if sha256 hash match, then do yet one more crypto hash of your choice to verify they are indeed the same blocks (fool me once, shame on me...), the hash key should be able to be based on only the sha256 marker then. If we do find a natural collision, then a special code path (and email to nsa =) could be in order. Is Honeycomb doing anything in this space? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?
Bill Sommerfeld [EMAIL PROTECTED] wrote on 01/08/2007 03:41:53 PM: Note that you'd actually have to verify that the blocks were the same; you cannot count on the hash function. If you didn't do this, anyone discovering a collision could destroy the colliding blocks/files. Given that nobody knows how to find sha256 collisions, you'd of course need to test this code with a weaker hash algorithm. (It would almost be worth it to have the code panic in the event that a real sha256 collision was found) - Bill That reminds me, I had a few more questions about this. 1, If a fs was started with a fletcher hash, and switched later to sha256, is there a way to resilver the hashes to sha256 that existed before the set? 2, Also is there any way to get zdb to dump a list of blocks and their associated hashes (zdb seems to be lightly documented and the source files for it require a little more familiarity with zfs internals than I have groked yet). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard-hang on snapshot rename
James Carlson [EMAIL PROTECTED] wrote on 01/08/2007 03:26:14 PM: [EMAIL PROTECTED] writes: I was noodling around with creating a backup script for my home system, and I ran into a problem that I'm having a little trouble diagnosing. Has anyone seen anything like this or have any debug advice? I did a zfs create -r to set a snapshot on all of the members of a given pool. Later, for reasons that are probably obscure, I wanted to rename that snapshot. There's no zfs rename -r function, so I tried to write a crude one on my own: do you mean zfs snapshot -r fsname@foo instead of the create? Yes; sorry. A bit of a typo there. hmm, just to verify sanity, have can you show the output of: zfs list -rHo name -t filesystem pool and zfs list -rHo name -t filesystem pool | while read name; do echo zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] done (note the echo inserted above) Sure, but it's not a shell problem. I should have mentioned that when I brought the system back up, *most* of the renames had actually taken place, but not *all* of them. I ended up with mostly [EMAIL PROTECTED], but with a handful of stragglers near the end of the list [EMAIL PROTECTED] The output looks a bit like this (not _all_ file systems shown, but representative ones): pool pool/HTSData pool/apache pool/client pool/csw pool/home pool/home/benjamin pool/home/beth pool/home/carlsonj pool/home/ftp pool/laptop pool/local pool/music pool/photo pool/sys pool/sys/core pool/sys/dhcp pool/sys/mail pool/sys/named And then: zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/home/[EMAIL PROTECTED] pool/home/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] It's not a matter of the shell script not working; it's a matter of something inside the kernel (perhaps not even ZFS but instead a driver related to SATA?) experiencing vapor-lock. Other heavy load on the system, though, doesn't cause this to happen. This one operation does cause the lock-up. Understood. Two things, does the rename loop hit any of the fs in question, and does putting a sort -r | before the while make any difference? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard-hang on snapshot rename
[EMAIL PROTECTED] wrote on 01/08/2007 04:06:46 PM: James Carlson [EMAIL PROTECTED] wrote on 01/08/2007 03:26:14 PM: [EMAIL PROTECTED] writes: I was noodling around with creating a backup script for my home system, and I ran into a problem that I'm having a little trouble diagnosing. Has anyone seen anything like this or have any debug advice? I did a zfs create -r to set a snapshot on all of the members of a given pool. Later, for reasons that are probably obscure, I wanted to rename that snapshot. There's no zfs rename -r function, so I tried to write a crude one on my own: do you mean zfs snapshot -r fsname@foo instead of the create? Yes; sorry. A bit of a typo there. hmm, just to verify sanity, have can you show the output of: zfs list -rHo name -t filesystem pool and zfs list -rHo name -t filesystem pool | while read name; do echo zfs rename [EMAIL PROTECTED] [EMAIL PROTECTED] done (note the echo inserted above) Sure, but it's not a shell problem. I should have mentioned that when I brought the system back up, *most* of the renames had actually taken place, but not *all* of them. I ended up with mostly [EMAIL PROTECTED], but with a handful of stragglers near the end of the list [EMAIL PROTECTED] Sorry missed this, ignore my first question The output looks a bit like this (not _all_ file systems shown, but representative ones): pool pool/HTSData ... zfs rename pool/sys/[EMAIL PROTECTED] pool/sys/[EMAIL PROTECTED] It's not a matter of the shell script not working; it's a matter of something inside the kernel (perhaps not even ZFS but instead a driver related to SATA?) experiencing vapor-lock. Other heavy load on the system, though, doesn't cause this to happen. This one operation does cause the lock-up. Understood. Two things, does the rename loop hit any of the fs in question, and does putting a sort -r | before the while make any difference? The reason I ask is because I had a similar issue running through batch renames (from epoch - human) of my snapshots. It seemed to cause a system lock unless I did the batch depth first (sort -r). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hard-hang on snapshot rename
[EMAIL PROTECTED] writes: Other heavy load on the system, though, doesn't cause this to happen. This one operation does cause the lock-up. Understood. Two things, does the rename loop hit any of the fs in question, No; the loop you saw is essentially what I ran. (Other than that it was level0new and level0 instead of foo and bar.) Thinking it was some locking issue, I did try saving off the list in a file (on tmpfs), and then running it through the while loop -- that produced the same result. and does putting a sort -r | before the while make any difference? I'll give it a try tonight and see. It's a production system, so I have to wait until all of the users are asleep or otherwise occupied by Two And A Half Men reruns to try something hazardous like that. -- James Carlson, Solaris Networking [EMAIL PROTECTED] Sun Microsystems / 1 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding disk to a RAID-Z?
I want to setup a ZFS server with RAID-Z. Right now I have 3 disks. In 6 months, I want to add a 4th drive and still have everything under RAID-Z without a backup/wipe/restore scenario. Is this possible? You can add additional storage to the same pool effortlessly, such that the pool will be striped across two raidz:s. You cannot (AFAIK) expand the raidz itself. End result is 9 disks, with 7 disks worth of effective storage capacity. The ZFS administratiion guide contains examples of doing exactly this, except I believe the examples use mirrors. ZFS administration guide: http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Distributed FS
Ivan wrote: Hi, Is ZFS comparable to PVFS2? Could it also be used as an distributed filesystem at the moment or are there any plans for this in the future? I don't know anything at all about PVFS2, so I can't comment on that point. As far as ZFS being used as a distributed file system, it cannot be used as such today, but it is something we sould like to develop. Do you have a specific use case in mind for a distributed file system? -- --Ed begin:vcard fn:Ed Gould n:Gould;Ed org:Sun Microsystems, Inc.;Solaris Cluster adr;dom:M/S UMPK17-201;;17 Network Circle;Menlo Park;CA;94025 email;internet:[EMAIL PROTECTED] title:File System Architect, PSARC Chair tel;work:+1.650.786.4937 x-mozilla-html:FALSE version:2.1 end:vcard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Anantha N. Srirama wrote: Quick update, since my original post I've confirmed via DTrace (rwtop script in toolkit) that the application is not generating 150MB/S * compressratio of I/O. What then is causing this much I/O in our system? This message posted from opensolaris.org Are you doing random IO? Appending or overwriting? - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: What SATA controllers are people using for ZFS?
For future reference for someone looking to build a ZFS storage server, the server config I am now using is Solaris 10 U3, has two Supermicro AOC-SAT2-MV8 controllers, 12 Seagate 750GB drives, 2 Seagate 160GB drives, and an Asus P5M2 motherboard (don't think these boards are yet for general sale, my vendor got them from Asus). The P5M2 has one PCIe x16 slot, two PCI-X 133/100/64bit slots, and PCI 33/32bit slot. The vendor's IT staff claimed that even though Solaris loaded on the Supermicro PDSME+ motherboard, there were frequent keyboard detection issues. Asus claimed they had tested the P5M2 with Solaris. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Distributed FS
Hi Ed, pNFS (Parallel NFS) could benefit by using a 'distributed filesystem version' of ZFS. By using pNFS files could be striped along different NFS servers. Lisa Week ([EMAIL PROTECTED]) told me that they would like to use ZFS in future pNFS Servers in Solaris. Thanks and best regards, Ivan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss