Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi list, If you're running solaris proper, you better mirror your ZIL log device. ... I plan to get to test this as well, won't be until late next week though. Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up normally with the following message in 'zpool status': ... pool: mypool state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run 'zpool online', or ignore the intent log records by running 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool FAULTED 0 0 0 bad intent log ... Nice! Running a later version of ZFS seems to lessen the need for ZIL-mirroring... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi list, If you're running solaris proper, you better mirror your ZIL log device. ... I plan to get to test this as well, won't be until late next week though. Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up normally with the following message in 'zpool status': ... pool: mypool state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run 'zpool online', or ignore the intent log records by running 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool FAULTED 0 0 0 bad intent log ... Nice! Running a later version of ZFS seems to lessen the need for ZIL-mirroring... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jeroen Roodhart If you're running solaris proper, you better mirror your ZIL log device. ... I plan to get to test this as well, won't be until late next week though. Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up normally [...] Nice! Running a later version of ZFS seems to lessen the need for ZIL- mirroring... Yes, since zpool 19, which is not available in any version of solaris yet, and is not available in osol 2009.06 unless you update to developer builds, Since zpool 19, you have the ability to zpool remove log devices. And if a log device fails during operation, the system is supposed to fall back and just start using ZIL blocks from the main pool instead. So the recommendation for zpool 19 would be *strongly* recommended. Mirror your log device if you care about using your pool. And the recommendation for zpool =19 would be ... don't mirror your log device. If you have more than one, just add them both unmirrored. I edited the ZFS Best Practices yesterday to reflect these changes. I always have a shade of doubt about things that are supposed to do something. Later this week, I am building an OSOL machine, updating it, adding an unmirrored log device, starting a sync-write benchmark (to ensure the log device is heavily in use) and then I'm going to yank out the log device, and see what happens. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 7 apr 2010, at 14.28, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jeroen Roodhart If you're running solaris proper, you better mirror your ZIL log device. ... I plan to get to test this as well, won't be until late next week though. Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up normally [...] Nice! Running a later version of ZFS seems to lessen the need for ZIL- mirroring... Yes, since zpool 19, which is not available in any version of solaris yet, and is not available in osol 2009.06 unless you update to developer builds, Since zpool 19, you have the ability to zpool remove log devices. And if a log device fails during operation, the system is supposed to fall back and just start using ZIL blocks from the main pool instead. So the recommendation for zpool 19 would be *strongly* recommended. Mirror your log device if you care about using your pool. And the recommendation for zpool =19 would be ... don't mirror your log device. If you have more than one, just add them both unmirrored. Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. For a file server, mail server, etc etc, where things are stored and supposed to be available later, you almost certainly want redundancy on your slog too. (There may be file servers where this doesn't apply, but they are special cases that should not be mentioned in the general documentation.) I edited the ZFS Best Practices yesterday to reflect these changes. I'd say, that In zpool version 19 or greater, it is recommended not to mirror log devices. is not a very good advice and should be changed. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 07/04/2010 13:58, Ragnar Sundblad wrote: Rather: ...=19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. For a file server, mail server, etc etc, where things are stored and supposed to be available later, you almost certainly want redundancy on your slog too. (There may be file servers where this doesn't apply, but they are special cases that should not be mentioned in the general documentation.) While I agree with you I want to mention that it is all about understanding a risk. In this case not only your server has to crash in such a way so data has not been synced (sudden power loss for example) but there would have to be some data committed to a slog device(s) which was not written to a main pool and when your server restarts your slog device would have to completely die as well. Other than that you are fine even with unmirrored slog device. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 7 Apr 2010, Ragnar Sundblad wrote: So the recommendation for zpool 19 would be *strongly* recommended. Mirror your log device if you care about using your pool. And the recommendation for zpool =19 would be ... don't mirror your log device. If you have more than one, just add them both unmirrored. Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims success. If the log device fails to read (oops!), then a mirror would be quite useful. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 07/04/2010 15:35, Bob Friesenhahn wrote: On Wed, 7 Apr 2010, Ragnar Sundblad wrote: So the recommendation for zpool 19 would be *strongly* recommended. Mirror your log device if you care about using your pool. And the recommendation for zpool =19 would be ... don't mirror your log device. If you have more than one, just add them both unmirrored. Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims success. If the log device fails to read (oops!), then a mirror would be quite useful. it is only read at boot if there are uncomitted data on it - during normal reboots zfs won't read data from slog. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 7 Apr 2010, Robert Milkowski wrote: it is only read at boot if there are uncomitted data on it - during normal reboots zfs won't read data from slog. How does zfs know if there is uncomitted data on the slog device without reading it? The minimal read would be quite small, but it seems that a read is still required. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/07/10 09:19, Bob Friesenhahn wrote: On Wed, 7 Apr 2010, Robert Milkowski wrote: it is only read at boot if there are uncomitted data on it - during normal reboots zfs won't read data from slog. How does zfs know if there is uncomitted data on the slog device without reading it? The minimal read would be quite small, but it seems that a read is still required. Bob If there's ever been synchronous activity then there an empty tail block (stubby) that will be read even after a clean shutdown. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
From: Ragnar Sundblad [mailto:ra...@csc.kth.se] Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. If you have a system crash, *and* a failed log device at the same time, this is an important consideration. But if you have either a system crash, or a failed log device, that don't happen at the same time, then your sync writes are safe, right up to the nanosecond. Using unmirrored nonvolatile log device on zpool = 19. I'd say, that In zpool version 19 or greater, it is recommended not to mirror log devices. is not a very good advice and should be changed. See above. Still disagree? If desired, I could clarify the statement, by basically pasting what's written above. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims success. If the log device fails to read (oops!), then a mirror would be quite useful. An excellent point. BTW, does the system *ever* read from the log device during normal operation? Such as perhaps during a scrub? It really would be nice to detect failure of log devices in advance, that are claiming to write correctly, but which are really unreadable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/07/10 10:18, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn It is also worth pointing out that in normal operation the slog is essentially a write-only device which is only read at boot time. The writes are assumed to work if the device claims success. If the log device fails to read (oops!), then a mirror would be quite useful. An excellent point. BTW, does the system *ever* read from the log device during normal operation? Such as perhaps during a scrub? It really would be nice to detect failure of log devices in advance, that are claiming to write correctly, but which are really unreadable. A scrub will read the log blocks but only for unplayed logs. Because of the transient nature of the log and becuase it operates outside of the transaction group model it's hard to read the in-flight log blocks to validate them. There have previously been suggestions to read slogs periodically. I don't know if there's a CR raised for this though. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 7 Apr 2010, Neil Perrin wrote: There have previously been suggestions to read slogs periodically. I don't know if there's a CR raised for this though. Roch wrote up CR 6938883 Need to exercise read from slog dynamically Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 7 Apr 2010, Edward Ned Harvey wrote: From: Ragnar Sundblad [mailto:ra...@csc.kth.se] Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. If you have a system crash, *and* a failed log device at the same time, this is an important consideration. But if you have either a system crash, or a failed log device, that don't happen at the same time, then your sync writes are safe, right up to the nanosecond. Using unmirrored nonvolatile log device on zpool = 19. The point is that the slog is a write-only device and a device which fails such that its acks each write, but fails to read the data that it wrote, could silently fail at any time during the normal operation of the system. It is not necessary for the slog device to fail at the exact same time that the system spontaneously reboots. I don't know if Solaris implements a background scrub of the slog as a normal course of operation which would cause a device with this sort of failure to be exposed quickly. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 7 Apr 2010, Edward Ned Harvey wrote: BTW, does the system *ever* read from the log device during normal operation? Such as perhaps during a scrub? It really would be nice to detect failure of log devices in advance, that are claiming to write correctly, but which are really unreadable. To make matters worse, a SSD with a large cache might satisfy such reads from its cache so a scrub of the (possibly) tiny bit of pending synchronous writes may not validate anything. A lightly loaded slog should usually be empty. We already know that some (many?) SSDs are not very good about persisting writes to FLASH, even after acking a cache flush request. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Apr 7, 2010, at 10:19 AM, Bob Friesenhahn wrote: On Wed, 7 Apr 2010, Edward Ned Harvey wrote: From: Ragnar Sundblad [mailto:ra...@csc.kth.se] Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. If you have a system crash, *and* a failed log device at the same time, this is an important consideration. But if you have either a system crash, or a failed log device, that don't happen at the same time, then your sync writes are safe, right up to the nanosecond. Using unmirrored nonvolatile log device on zpool = 19. The point is that the slog is a write-only device and a device which fails such that its acks each write, but fails to read the data that it wrote, could silently fail at any time during the normal operation of the system. It is not necessary for the slog device to fail at the exact same time that the system spontaneously reboots. I don't know if Solaris implements a background scrub of the slog as a normal course of operation which would cause a device with this sort of failure to be exposed quickly. You are playing against marginal returns. An ephemeral storage requirement is very different than permanent storage requirement. For permanent storage services, scrubs work well -- you can have good assurance that if you read the data once then you will likely be able to read the same data again with some probability based on the expected decay of the data. For ephemeral data, you do not read the same data more than once, so there is no correlation between reading once and reading again later. In other words, testing the readability of an ephemeral storage service is like a cat chasing its tail. IMHO, this is particularly problematic for contemporary SSDs that implement wear leveling. sidebar For clusters the same sort of problem exists for path monitoring. If you think about paths (networks, SANs, cups-n-strings) then there is no assurance that a failed transfer means all subsequent transfers will also fail. Some other permanence test is required to predict future transfer failures. s/fail/pass/g /sidebar Bottom line: if you are more paranoid, mirror the separate log devices and sleep through the night. Pleasant dreams! :-) -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
jr == Jeroen Roodhart j.r.roodh...@uva.nl writes: jr Running OSOL nv130. Power off the machine, removed the F20 and jr power back on. Machines boots OK and comes up normally with jr the following message in 'zpool status': yeah, but try it again and this time put rpool on the F20 as well and try to import the pool from a LiveCD: if you lose zpool.cache at this stage, your pool is toast./end repeat mode pgpt1GZtrVxS6.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 7 apr 2010, at 18.13, Edward Ned Harvey wrote: From: Ragnar Sundblad [mailto:ra...@csc.kth.se] Rather: ... =19 would be ... if you don't mind loosing data written the ~30 seconds before the crash, you don't have to mirror your log device. If you have a system crash, *and* a failed log device at the same time, this is an important consideration. But if you have either a system crash, or a failed log device, that don't happen at the same time, then your sync writes are safe, right up to the nanosecond. Using unmirrored nonvolatile log device on zpool = 19. Right, but if you have a power or a hardware problem, chances are that more things really break at the same time, including the slog device(s). I'd say, that In zpool version 19 or greater, it is recommended not to mirror log devices. is not a very good advice and should be changed. See above. Still disagree? If desired, I could clarify the statement, by basically pasting what's written above. I believe that for a mail server, NFS server (to be spec compliant), general purpose file server and the like, where the last written data is as important as older data (maybe even more), it would be wise to have at least as good redundancy on the slog as on the data disks. If one can stand the (pretty small) risk of of loosing the last transaction group before a crash, at the moment typically up to the last 30 seconds of changes, you may have less redundancy on the slog. (And if you don't care at all, like on a web cache perhaps, you could of course disable the zil all together - that is kind of the other end of the scale, which puts this in perspective.) As Robert M so wisely and simply put it; It is all about understanding a risk. I think the documentation should help people take educated decisions, though I am not right now sure how to put the words to describe this in an easily understandable way. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Roch, Can you try 4 concurrent tar to four different ZFS filesystems (same pool). Hmmm, you're on to something here: http://www.science.uva.nl/~jeroen/zil_compared_e1000_iostat_iops_svc_t_10sec_interval.pdf In short: when using two exported file systems total time goes down to around 4mins (IOPS maxes out at around 5500 when adding all four vmods together). When using four file systems total time goes down to around 3min30s (IOPS maxing out at about 9500). I figured it is either NFS or a per file system data structure in the ZFS/ZIL interface. To rule out NFS I tried exporting two directories using default NFS shares (via /etc/dfs/dfstab entries). To my surprise this seems to bypass the ZIL all together (dropping to 100 IOPS, which results from our RAIDZ2 configuration). So clearly ZFS sharenfs is more than a nice front end for NFS configuration :). But back to your suggestion: You clearly had a hypothesis behind your question. Care to elaborate? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
We ran into something similar with these drives in an X4170 that turned out to be an issue of the preconfigured logical volumes on the drives. Once we made sure all of our Sun PCI HBAs where running the exact same version of firmware and recreated the volumes on new drives arriving from Sun we got back into sync on the X25-E devices sizes. Can you elaborate? Just today, we got the replacement drive that has precisely the right version of firmware and everything. Still, when we plugged in that drive, and create simple volume in the storagetek raid utility, the new drive is 0.001 Gb smaller than the old drive. I'm still hosed. Are you saying I might benefit by sticking the SSD into some laptop, and zero'ing the disk? And then attach to the sun server? Are you saying I might benefit by finding some other way to make the drive available, instead of using the storagetek raid utility? Thanks for the suggestions... Sorry for the double post. Since the wrong-sized drive was discussed in two separate threads, I want to stick a link here to the other one, where the question was answered. Just incase anyone comes across this discussion by search or whatever... http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039669.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 4/4/2010 11:04 PM, Edward Ned Harvey wrote: Actually, It's my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD's. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. Actually, if there is a fdisk partition and/or disklabel on a drive when it arrives, I'm pretty sure that's irrelevant. Because when I first connect a new drive to the HBA, of course the HBA has to sign and initialize the drive at a lower level than what the OS normally sees. So unless I do some sort of special operation to tell the HBA to preserve/import a foreign disk, the HBA will make the disk blank before the OS sees it anyway. That may be true. Though these days they may be spec'ing the drives to the manufacturer's at an even lower level. So does your HBA have newer firmware now than it did when the first disk was connected? Maybe it's the HBA that is handling the new disks differently now, than it did when the first one was plugged in? Can you down rev the HBA FW? Do you have another HBa that might still have the older Rev you coudltest it on? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
From: Kyle McDonald [mailto:kmcdon...@egenera.com] So does your HBA have newer firmware now than it did when the first disk was connected? Maybe it's the HBA that is handling the new disks differently now, than it did when the first one was plugged in? Can you down rev the HBA FW? Do you have another HBa that might still have the older Rev you coudltest it on? I'm planning to get the support guys more involved tomorrow, so ... things have been pretty stagnant for several days now, I think it's time to start putting more effort into this. Long story short, I don't know yet. But there is one glaring clue: Prior to OS installation, I don't know how to configure the HBA. This means the HBA must have been preconfigured with the factory installed disks, and I followed a different process with my new disks, because I was using the GUI within the OS. My best hope right now is to find some other way to configure the HBA, possibly through the ILOM, but I already searched there and looked at everything. Maybe I have to shutdown (power cycle) the system and attach keyboard monitor. I don't know yet... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 4 apr 2010, at 06.01, Richard Elling wrote: Thank you for your reply! Just wanted to make sure. Do not assume that power outages are the only cause of unclean shutdowns. -- richard Thanks, I have seen that mistake several times with other (file)systems, and hope I'll never ever make it myself! :-) /ragge s ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hmm, when you did the write-back test was the ZIL SSD included in the write-back? What I was proposing was write-back only on the disks, and ZIL SSD with no write-back. The tests I did were: All disks write-through All disks write-back With/without SSD for ZIL All the permutations of the above. So, unfortunately, no, I didn't test with WriteBack enabled only for spindles, and WriteThrough on SSD. It has been suggested, and this is actually what I now believe based on my experience, that precisely the opposite would be the better configuration. If the spindles are configured WriteThrough, while the SSD is configured WriteBack. I believe would be optimal. If I get the opportunity to test further, I'm interested and I will. But who knows when/if that will happen. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Actually, It's my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD's. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. Actually, if there is a fdisk partition and/or disklabel on a drive when it arrives, I'm pretty sure that's irrelevant. Because when I first connect a new drive to the HBA, of course the HBA has to sign and initialize the drive at a lower level than what the OS normally sees. So unless I do some sort of special operation to tell the HBA to preserve/import a foreign disk, the HBA will make the disk blank before the OS sees it anyway. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. This is what Jeff Bonwick says in the zil synchronicity arc case: What I mean is that the barrier semantic is implicit even with no ZIL at all. In ZFS, if event A happens before event B, and you lose power, then what you'll see on disk is either nothing, A, or both A and B. Never just B. It is impossible for us not to have at least barrier semantics. So there's no chance that a *later* async write will overtake an earlier sync *or* async write. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 04/02/10 08:24, Edward Ned Harvey wrote: The purpose of the ZIL is to act like a fast log for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. Bob and Casper and some others clearly know a lot here. But I'm hearing conflicting information, and don't know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim I can answer this question, I wrote that code, or at least have read it? I'm one of the ZFS developers. I wrote most of the zil code. Still I don't have all the answers. There's a lot of knowledgeable people on this alias. I usually monitor this alias and sometimes chime in when there's some misinformation being spread, but sometimes the volume is so high. Since I started this reply there's been 20 new posts on this thread alone! Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? - The intent log (separate device(s) or not) is only used by fsync, O_DSYNC, O_SYNC, O_RSYNC. NFS commits are seen to ZFS as fsyncs. Note sync(1m) and sync(2s) do not use the intent log. They force transaction group (txg) commits on all pools. So zfs goes beyond the the requirement for sync() which only requires it schedules but does not necessarily complete the writing before returning. The zfs interpretation is rather expensive but seemed broken so we fixed it. Is it ever used to accelerate async writes? The zil is not used to accelerate async writes. Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? - Kind of. The uberblock contains the root of the txg. At boot time, or zpool import time, what is taken to be the current filesystem? The latest uberblock? Something else? A txg is for the whole pool which can contain many filesystems. The latest txg defines the current state of the pool and each individual fs. My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. Correct (except replace sync() with O_DSYNC, etc). This also assumes hardware that for example handles correctly the flushing of it's caches. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. The ZIL doesn't make such guarantees. It's the DMU that handles transactions and their grouping into txgs. It ensures that writes are committed in order by it's transactional nature. The function of the zil is to merely ensure that synchronous operations are stable and replayed after a crash/power fail onto the latest txg. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. No, disabling the ZIL does not disable the DMU. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Al, Have you tried the DDRdrive from Christopher George cgeo...@ddrdrive.com? Looks to me like a much better fit for your application than the F20? It would not hurt to check it out. Looks to me like you need a product with low *latency* - and a RAM based cache would be a much better performer than any solution based solely on flash. Let us know (on the list) how this works out for you. Well, I did look at it but at that time there was no Solaris support yet. Right now it seems there is only a beta driver? I kind of remember that if you'd want reliable fallback to nvram, you'd need an UPS feeding the card. I could be very wrong there, but the product documentation isn't very clear on this (at least to me ;) ) Also, we'd kind of like to have a SnOracle supported option. But yeah, on paper it does seem it could be an attractive solution... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Well, I did look at it but at that time there was no Solaris support yet. Right now it seems there is only a beta driver? Correct, we just completed functional validation of the OpenSolaris driver. Our focus has now turned to performance tuning and benchmarking. We expect to formally introduce the DDRdrive X1 to the ZFS community later this quarter. It is our goal to focus exclusively on the dedicated ZIL device market going forward. I kind of remember that if you'd want reliable fallback to nvram, you'd need an UPS feeding the card. Currently, a dedicated external UPS is required for correct operation. Based on community feedback, we will be offering automatic backup/restore prior to release. This guarantees the UPS will only be required for 60 secs to successfully backup the drive contents on a host power or hardware failure. Dutifully on the next reboot the restore will occur prior to the OS loading for seamless non-volatile operation. Also,we have heard loud and clear the requests for a internal power option. It is our intention the X1 will be the first in a family of products all dedicated to ZIL acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD. Also, we'd kind of like to have a SnOracle supported option. Although a much smaller company, we believe our singular focus and absolute passion for ZFS and the potential of Hybrid Storage Pools will serve our customers well. We are actively designing our soon to be available support plans. Your voice will be heard, please email directly at cgeorge at ddrdrive dot com for requests, comments and/or questions. Thanks, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 1 apr 2010, at 06.15, Stuart Anderson wrote: Assuming you are also using a PCI LSI HBA from Sun that is managed with a utility called /opt/StorMan/arcconf and reports itself as the amazingly informative model number Sun STK RAID INT what worked for me was to run, arcconf delete (to delete the pre-configured volume shipped on the drive) arcconf create (to create a new volume) Just to sort things out (or not? :-): I more than agree that this product is highly confusing, but I don't think there is anything LSI in or about that card. I believe it is an Adaptec card, developed, manufactured and supported by Intel for Adaptec, licensed (or something) to StorageTek, and later included in Sun machines (since Sun bought StorageTek, I suppose). Now we could add Oracle to this name dropping inferno, if we would want to. I am not sure why they (Sun) put those in there, they don't seem very fast or smart or anything. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 2 apr 2010, at 22.47, Neil Perrin wrote: Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. Could you please clarify this last paragraph a little: Do you mean that this is in the case that you have ZIL enabled and the txg for W1 and W2 hasn't been commited, so that upon reboot the ZIL is replayed, and therefore only the sync writes are eventually there? If, lets say, W1 is an async small write, W2 is a sync small write, W1 arrives to zfs before W2, and W2 arrives before the txg is commited, will both writes always be in the txg on disk? If so, it would mean that zfs itself never buffer up async writes to larger blurbs to write at a later txg, correct? I take it that ZIL enabled or not does not make any difference here (we pretend the system did _not_ crash), correct? Thanks! /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote: On 2 apr 2010, at 22.47, Neil Perrin wrote: Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Threads can be pre-empted in the OS at any time. So even though thread A issued W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS as W1, W2. Multi-threaded applications have to handle this. If this was a single thread issuing W1 then W2 then yes the order is guaranteed regardless of whether W1 or W2 are synchronous or asynchronous. Of course if the system crashes then the async operations might not be there. Could you please clarify this last paragraph a little: Do you mean that this is in the case that you have ZIL enabled and the txg for W1 and W2 hasn't been commited, so that upon reboot the ZIL is replayed, and therefore only the sync writes are eventually there? yes. The ZIL needs to be replayed on import after an unclean shutdown. If, lets say, W1 is an async small write, W2 is a sync small write, W1 arrives to zfs before W2, and W2 arrives before the txg is commited, will both writes always be in the txg on disk? yes If so, it would mean that zfs itself never buffer up async writes to larger blurbs to write at a later txg, correct? correct I take it that ZIL enabled or not does not make any difference here (we pretend the system did _not_ crash), correct? For import following a clean shutdown, there are no transactions in the ZIL to apply. For async-only workloads, there are no transactions in the ZIL to apply. Do not assume that power outages are the only cause of unclean shutdowns. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 20:58, Jeroen Roodhart wrote: I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asyn chronously mount a Solaris (ZFS backed) NFS export... Which is to be expected as it is not a nfs client which requests the behavior but rather a nfs server. Currently on Linux you can export a share with as sync (default) or async share while on Solaris you can't really currently force a NFS server to start working in an async mode. The other part of the issue is that the Solaris Clients have been developed with a sync server. The client write behinds more and continues caching the non-acked data. The Linux client has been developed with a async server and has some catching up to do. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Robert Milkowski writes: On 01/04/2010 20:58, Jeroen Roodhart wrote: I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... Which is to be expected as it is not a nfs client which requests the behavior but rather a nfs server. Currently on Linux you can export a share with as sync (default) or async share while on Solaris you can't really currently force a NFS server to start working in an async mode. True, and there is an entrenched misconception (not you) that this a ZFS specific problem which it's not. It's really an NFS protocol feature which can be circumvented using zil_disable which therefore reinforces the misconception. It's further reinforced by testing NFS server on disk drives with WCE=1 with filesystem not ZFS. All fast options cause the NFS client to become inconsistent after a server reboot. Whatever was being done in the moments prior to server reboot will need to be wiped out by users if they are told that the server did reboot. That's manageable for home use not for the entreprise. -r -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Seriously, all disks configured WriteThrough (spindle and SSD disks alike) using the dedicated ZIL SSD device, very noticeably faster than enabling the WriteBack. What do you get with both SSD ZIL and WriteBack disks enabled? I mean if you have both why not use both? Then both async and sync IO benefits. Interesting, but unfortunately false. Soon I'll post the results here. I just need to package them in a way suitable to give the public, and stick it on a website. But I'm fighting IT fires for now and haven't had the time yet. Roughly speaking, the following are approximately representative. Of course it varies based on tweaks of the benchmark and stuff like that. Stripe 3 mirrors write through: 450-780 IOPS Stripe 3 mirrors write back: 1030-2130 IOPS Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPS Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD ZIL is 3-4 times faster than naked disk. And for some reason, having the WriteBack enabled while you have SSD ZIL actually hurts performance by approx 10%. You're better off to use the SSD ZIL with disks in Write Through mode. That result is surprising to me. But I have a theory to explain it. When you have WriteBack enabled, the OS issues a small write, and the HBA immediately returns to the OS: Yes, it's on nonvolatile storage. So the OS quickly gives it another, and another, until the HBA write cache is full. Now the HBA faces the task of writing all those tiny writes to disk, and the HBA must simply follow orders, writing a tiny chunk to the sector it said it would write, and so on. The HBA cannot effectively consolidate the small writes into a larger sequential block write. But if you have the WriteBack disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on SSD, and immediately return to the process: Yes, it's on nonvolatile storage. So the application can issue another, and another, and another. ZFS is smart enough to aggregate all these tiny write operations into a single larger sequential write before sending it to the spindle disks. Long story short, the evidence suggests if you have SSD ZIL, you're better off without WriteBack on the HBA. And I conjecture the reasoning behind it is because ZFS can write buffer better than the HBA can. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
I know it is way after the fact, but I find it best to coerce each drive down to the whole GB boundary using format (create Solaris partition just up to the boundary). Then if you ever get a drive a little smaller it still should fit. It seems like it should be unnecessary. It seems like extra work. But based on my present experience, I reached the same conclusion. If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? Nothing. That's what. I take it back. Me. I am to prevent it from happening. And the technique to do so is precisely as you've said. First slice every drive to be a little smaller than actual. Then later if I get a replacement device for the mirror, that's slightly smaller than the others, I have no reason to care. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
When we use one vmod, both machines are finished in about 6min45, zilstat maxes out at about 4200 IOPS. Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS. Can you try 4 concurrent tar to four different ZFS filesystems (same pool). -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
http://nfs.sourceforge.net/ I think B4 is the answer to Casper's question: We were talking about ZFS, and under what circumstances data is flushed to disk, in what way sync and async writes are handled by the OS, and what happens if you disable ZIL and lose power to your system. We were talking about C/C++ sync and async. Not NFS sync and async. I don't think anything relating to NFS is the answer to Casper's question, or else, Casper was simply jumping context by asking it. Don't get me wrong, I have no objection to his question or anything, it's just that the conversation has derailed and now people are talking about NFS sync/async instead of what happens when a C/C++ application is doing sync/async writes to a disabled ZIL. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
I am envisioning a database, which issues a small sync write, followed by a larger async write. Since the sync write is small, the OS would prefer to defer the write and aggregate into a larger block. So the possibility of the later async write being committed to disk before the older sync write is a real risk. The end result would be inconsistency in my database file. Zfs writes data in transaction groups and each bunch of data which gets written is bounded by a transaction group. The current state of the data at the time the TXG starts will be the state of the data once the TXG completes. If the system spontaneously reboots then it will restart at the last completed TXG so any residual writes which might have occured while a TXG write was in progress will be discarded. Based on this, I think that your ordering concerns (sync writes getting to disk faster than async writes) are unfounded for normal file I/O. So you're saying that while the OS is building txg's to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg's. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? If that's true, if there's no increased risk of data corruption, then why doesn't everybody just disable their ZIL all the time on every system? The reason to have a sync() function in C/C++ is so you can ensure data is written to disk before you move on. It's a blocking call, that doesn't return until the sync is completed. The only reason you would ever do this is if order matters. If you cannot allow the next command to begin until after the previous one was completed. Such is the situation with databases and sometimes virtual machines. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
hello i have had this problem this week. our zil ssd died (apt slc ssd 16gb). because we had no spare drive in stock, we ignored it. then we decided to update our nexenta 3 alpha to beta, exported the pool and made a fresh install to have a clean system and tried to import the pool. we only got a error message about a missing drive. we googled about this and it seems there is no way to acces the pool !!! (hope this will be fixed in future) we had a backup and the data are not so important, but that could be a real problem. you have a valid zfs3 pool and you cannot access your data due to missing zil. If you have zpool less than version 19 (when ability to remove log device was introduced) and you have a non-mirrored log device that failed, you had better treat the situation as an emergency. Normally you can find your current zpool version by doing zpool upgrade, but you cannot now if you're in this failure state. Do not attempt zfs send or zfs list or any other zpool or zfs command. Instead, do man zpool and look for zpool remove. If it says supports removing log devices then you had better use it to remove your log device. If it says only supports removing hotspares or cache then your zpool is lost permanently. If you are running Solaris, take it as given, you do not have zpool version 19. If you are running Opensolaris, I don't know at which point zpool 19 was introduced. Your only hope is to zpool remove the log device. Use tar or cp or something, to try and salvage your data out of there. Your zpool is lost and if it's functional at all right now, it won't stay that way for long. Your system will soon hang, and then you will not be able to import your pool. Ask me how I know. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
ZFS recovers to a crash-consistent state, even without the slog, meaning it recovers to some state through which the filesystem passed in the seconds leading up to the crash. This isn't what UFS or XFS do. The on-disk log (slog or otherwise), if I understand right, can actually make the filesystem recover to a crash-INconsistent state (a You're speaking the opposite of common sense. If disabling the ZIL makes the system faster *and* less prone to data corruption, please explain why we don't all disable the ZIL? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you have zpool less than version 19 (when ability to remove log device was introduced) and you have a non-mirrored log device that failed, you had better treat the situation as an emergency. Instead, do man zpool and look for zpool remove. If it says supports removing log devices then you had better use it to remove your log device. If it says only supports removing hotspares or cache then your zpool is lost permanently. I take it back. If you lost your log device on a zpool which is less than version 19, then you *might* have a possible hope if you migrate your disks to a later system. You *might* be able to zpool import on a later version of OS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
http://nfs.sourceforge.net/ I think B4 is the answer to Casper's question: We were talking about ZFS, and under what circumstances data is flushed to disk, in what way sync and async writes are handled by the OS, and what happens if you disable ZIL and lose power to your system. We were talking about C/C++ sync and async. Not NFS sync and async. I don't think so. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html (This discussion was started, I think, in the context of NFS performance) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
So you're saying that while the OS is building txg's to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg's. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? The question is not how the writes are ordered but whether an earlier write can be in a later txg. A transaction group is committed atomically. In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar question to make sure I understand it correctly, and the answer was: = Casper, the answer is from Neil Perrin: Is there a partialy order defined for all filesystem operations? File system operations will be written in order for all settings of the sync flag. Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a file, (I assume by O_DATA you meant O_DSYNC). that later transactions will not be in an earlier transaction group? (Or is this already the case?) This is already the case. So what I assumed was true but what you made me doubt, was apparently still true: later transactions cannot be committed in an earlier txg. If that's true, if there's no increased risk of data corruption, then why doesn't everybody just disable their ZIL all the time on every system? For an application running on the file server, there is no difference. When the system panics you know that data might be lost. The application also dies. (The snapshot and the last valid uberblock are equally valid) But for an application on an NFS client, without ZIL data will be lost while the NFS client believes the data is written amd it will not try again. With the ZIL, when the NFS server says that data is written then it is actually on stable storage. The reason to have a sync() function in C/C++ is so you can ensure data is written to disk before you move on. It's a blocking call, that doesn't return until the sync is completed. The only reason you would ever do this is if order matters. If you cannot allow the next command to begin until after the previous one was completed. Such is the situation with databases and sometimes virtual machines. So the question is: when will your data invalid? What happens with the data when the system dies before the fsync() call? What happens with the data when the system dies after the fsync() call? What happens with the data when the system dies after more I/O operations? With the zil disabled, you call fsync() but you may encounter data from before the call to fsync(). That could happen before, so I assume you can actually recover from that situation. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Dude, don't be so arrogant. Acting like you know what I'm talking about better than I do. Face it that you have something to learn here. You may say that, but then you post this: Acknowledged. I read something arrogant, and I replied even more arrogant. That was dumb of me. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Only a broken application uses sync writes sometimes, and async writes at other times. Suppose there is a virtual machine, with virtual processes inside it. Some virtual process issues a sync write to the virtual OS, meanwhile another virtual process issues an async write. Then the virtual OS will sometimes issue sync writes and sometimes async writes to the host OS. Are you saying this makes qemu, and vbox, and vmware broken applications? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The purpose of the ZIL is to act like a fast log for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. Bob and Casper and some others clearly know a lot here. But I'm hearing conflicting information, and don't know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim I can answer this question, I wrote that code, or at least have read it? Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? Is it ever used to accelerate async writes? Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? At boot time, or zpool import time, what is taken to be the current filesystem? The latest uberblock? Something else? My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you disable your ZIL, then there is no guarantee your snapshots are consistent either. Rolling back doesn't necessarily gain you anything. The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? Is it ever used to accelerate async writes? There are quite a few of sync writes, specifically when you mix in the NFS server. Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. What I quoted from the other discussion, it seems to be that later writes cannot be committed in an earlier TXG then your sync write or other earlier writes. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? The uberblock is the root of all the data. All the data in a ZFS pool is referenced by it; after the txg is in stable storage then the uberblock is updated. At boot time, or zpool import time, what is taken to be the current filesystem? The latest uberblock? Something else? The current zpool and the filesystems such as referenced by the last uberblock. My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. sync() is actually *async* and returning from sync() says nothing about stable storage. After fsync() returns it signals that all the data is in stable storage (except if you disable ZIL), or, apparently, in Linux when the write caches for your disks are enabled (the default for PC drives). ZFS doesn't care about the writecache; it makes sure it is flushed. (There's fsyc() and open(..., O_DSYNC|O_SYNC) Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you disable your ZIL, then there is no guarantee your snapshots are consistent either. Rolling back doesn't necessarily gain you anything. The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. I believe that the writes are still ordered so the consistency you want is actually delivered even without the ZIL enabled. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 4/2/2010 8:08 AM, Edward Ned Harvey wrote: I know it is way after the fact, but I find it best to coerce each drive down to the whole GB boundary using format (create Solaris partition just up to the boundary). Then if you ever get a drive a little smaller it still should fit. It seems like it should be unnecessary. It seems like extra work. But based on my present experience, I reached the same conclusion. If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? Nothing. That's what. Actually, It's my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD's. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same reasons. I'm a little surprised that the engineers would suddenly stop doing it only on SSD's. But who knows. -Kyle I take it back. Me. I am to prevent it from happening. And the technique to do so is precisely as you've said. First slice every drive to be a little smaller than actual. Then later if I get a replacement device for the mirror, that's slightly smaller than the others, I have no reason to care. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey solar...@nedharvey.com wrote: The purpose of the ZIL is to act like a fast log for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. Bob and Casper and some others clearly know a lot here. But I'm hearing conflicting information, and don't know what to believe. Does anyone here work on ZFS as an actual ZFS developer for Sun/Oracle? Can claim I can answer this question, I wrote that code, or at least have read it? Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? Is it ever used to accelerate async writes? sync() will tell the filesystems to flush writes to disk. sync() will not use ZIL, it will just start a new TXG, and could return before the writes are done. fsync() is what you are interested in. Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. Writers from a TXG will not be used until the whole TXG is committed to disk. Everything from a half written TXG will be ignored after a crash. This means that the order of writes within a TXG is not important. The only way to do a sync write without ZIL is to start a new TXG after the write. That costs a lot so we have the ZIL for sync writes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, 2 Apr 2010, Edward Ned Harvey wrote: So you're saying that while the OS is building txg's to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg's. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? I am like a pool or tank of regurgitated zfs knowledge. I simply pay attention when someone who really knows explains something (e.g. Neil Perrin, as Casper referred to) so I can regurgitate it later. I try to do so faithfully. If I had behaved this way in school, I would have been a good student. Sometimes I am wrong or the design has somewhat changed since the original information was provided. There are indeed popular filesystems (e.g. Linux EXT4) which write data to disk in different order than cronologically requested so it is good that you are paying attention to these issues. While in the slog-based recovery scenario, it is possible for a TXG to be generated which lacks async data, this only happens after a system crash and if all of the critical data is written as a sync request, it will be faithfully preserved. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, 2 Apr 2010, Edward Ned Harvey wrote: were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. You seem to be assuming that Solaris is an incoherent operating system. With ZFS, the filesystem in memory is coherent, and transaction groups are constructed in simple chronological order (capturing combined changes up to that point in time), without regard to SYNC options. The only possible exception to the coherency is for memory mapped files, where the mapped memory is a copy of data (originally) from the ZFS ARC and needs to be reconciled with the ARC if an application has dirtied it. This differs from UFS and the way Solaris worked prior to Solaris 10. Synchronous writes are not faster than asynchronous writes. If you drop heavy and light objects from the same height, they fall at the same rate. This was proven long ago. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote: I know it is way after the fact, but I find it best to coerce each drive down to the whole GB boundary using format (create Solaris partition just up to the boundary). Then if you ever get a drive a little smaller it still should fit. It seems like it should be unnecessary. It seems like extra work. But based on my present experience, I reached the same conclusion. If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? Nothing. That's what. I take it back. Me. I am to prevent it from happening. And the technique to do so is precisely as you've said. First slice every drive to be a little smaller than actual. Then later if I get a replacement device for the mirror, that's slightly smaller than the others, I have no reason to care. However, I believe there are some downsides to letting ZFS manage just a slice rather than an entire drive, but perhaps those do not apply as significantly to SSD devices? Thanks -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey solar...@nedharvey.com wrote: Seriously, all disks configured WriteThrough (spindle and SSD disks alike) using the dedicated ZIL SSD device, very noticeably faster than enabling the WriteBack. What do you get with both SSD ZIL and WriteBack disks enabled? I mean if you have both why not use both? Then both async and sync IO benefits. Interesting, but unfortunately false. Soon I'll post the results here. I just need to package them in a way suitable to give the public, and stick it on a website. But I'm fighting IT fires for now and haven't had the time yet. Roughly speaking, the following are approximately representative. Of course it varies based on tweaks of the benchmark and stuff like that. Stripe 3 mirrors write through: 450-780 IOPS Stripe 3 mirrors write back: 1030-2130 IOPS Stripe 3 mirrors write back + SSD ZIL: 1220-2480 IOPS Stripe 3 mirrors write through + SSD ZIL: 1840-2490 IOPS Overall, I would say WriteBack is 2-3 times faster than naked disks. SSD ZIL is 3-4 times faster than naked disk. And for some reason, having the WriteBack enabled while you have SSD ZIL actually hurts performance by approx 10%. You're better off to use the SSD ZIL with disks in Write Through mode. That result is surprising to me. But I have a theory to explain it. When you have WriteBack enabled, the OS issues a small write, and the HBA immediately returns to the OS: Yes, it's on nonvolatile storage. So the OS quickly gives it another, and another, until the HBA write cache is full. Now the HBA faces the task of writing all those tiny writes to disk, and the HBA must simply follow orders, writing a tiny chunk to the sector it said it would write, and so on. The HBA cannot effectively consolidate the small writes into a larger sequential block write. But if you have the WriteBack disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on SSD, and immediately return to the process: Yes, it's on nonvolatile storage. So the application can issue another, and another, and another. ZFS is smart enough to aggregate all these tiny write operations into a single larger sequential write before sending it to the spindle disks. Hmm, when you did the write-back test was the ZIL SSD included in the write-back? What I was proposing was write-back only on the disks, and ZIL SSD with no write-back. Not all operations hit the ZIL, so it would still be nice to have the non-ZIL operations return quickly. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 02/04/2010 16:04, casper@sun.com wrote: sync() is actually *async* and returning from sync() says nothing about to clarify - in case of ZFS sync() is actually synchronous. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? There is a standard for sizes that many manufatures use (IDEMA LBA1-02): LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes – 50.0)) Sizes should match exactly if the manufacturer follows the standard. See: http://opensolaris.org/jive/message.jspa?messageID=393336#393336 http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
enh == Edward Ned Harvey solar...@nedharvey.com writes: enh If you have zpool less than version 19 (when ability to remove enh log device was introduced) and you have a non-mirrored log enh device that failed, you had better treat the situation as an enh emergency. Ed the log device removal support is only good for adding a slog to try it out, then changing your mind and removing the slog (which was not possible before). It doesn't change the reliability situation one bit: pools with dead slogs are not importable. There've been threads on this for a while. It's well-discussed because it's an example of IMHO broken process of ``obviously a critical requirement but not technically part of the original RFE which is already late,'' as well as a dangerous pitfall for ZFS admins. I imagine the process works well in other cases to keep stuff granular enough that it can be prioritized effectively, but in this case it's made the slog feature significantly incomplete for a couple years and put many production systems in a precarious spot, and the whole mess was predicted before the slog feature was integrated. The on-disk log (slog or otherwise), if I understand right, can actually make the filesystem recover to a crash-INconsistent state enh You're speaking the opposite of common sense. Yeah, I'm doing it on purpose to suggest that just guessing how you feel things ought to work based on vague notions of economy isn't a good idea. enh If disabling the ZIL makes the system faster *and* less prone enh to data corruption, please explain why we don't all disable enh the ZIL? I said complying with fsync can make the system recover to a state not equal to one you might have hypothetically snapshotted in a moment leading up to the crash. Elsewhere I might've said disabling the ZIL does not make the system more prone to data corruption, *iff* you are not an NFS server. If you are, disabling the ZIL can lead to lost writes if an NFS server reboots and an NFS client does not, which can definitely cause app-level data corruption. Disabling the ZIL breaks the D requirement of ACID databases which might screw up apps that replicate, or keep databases on several separate servers in sync, and it might lead to lost mail on an MTA, but because unlike non-COW filesystems it costs nothing extra for ZFS to preserve write ordering even without fsync(), AIUI you will not get corrupted application-level data by disabling the ZIL. you just get missing data that the app has a right to expect should be there. The dire warnings written by kernel developers in the wikis of ``don't EVER disable the ZIL'' are totally ridiculous and inappropriate IMO. I think they probably just worked really hard to write the ZIL piece of ZFS, and don't want people telling their brilliant code to fuckoff just because it makes things a little slower. so we get all this ``enterprise'' snobbery and so on. ``crash consistent'' is a technical term not a common-sense term, and I may have used it incorrectly: http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html With a system that loses power on which fsync() had been in use, the files getting fsync()'ed will probably recover to more recent versions than the rest of the files, which means the recovered state achieved by yanking the cord couldn't have been emulated by cloning a snapshot and not actually having lost power. However, the app calling fsync() will expect this, so it's not supposed to lead to application-level inconsistency. If you test your app's recovery ability in just that way, by cloning snapshots of filesystems on which the app is actively writing and then seeing if the app can recover the clone, then you're unfortunately not testing the app quite hard enough if fsync() is involved, so yeah I guess disabling the ZIL might in theory make incorrectly-written apps less prone to data corruption. Likewise, no testing of the app on a ZFS will be aggressive enough to make the app powerfail-proof on a non-COW POSIX system because ZFS keeps more ordering than the API actually guarantees to the app. I'm repeating myself though. I wish you'll just read my posts with at least paragraph granularity instead of just picking out individual sentences and discarding everything that seems too complicated or too awkwardly stated. I'm basing this all on the ``common sense'' that to do otherwise, fsync() would have to completely ignore its filedescriptor argument. It'd have to copy the entire in-memory ZIL to the slog and behave the same as 'lockfs -fa', which I think would perform too badly compared to non-ZFS filesystems' fsync()s, and would lead to emphatic performance advice like ``segregate files that get lots of fsync()s into separate ZFS datasets from files that get high write bandwidth,'' and we don't have advice like that in the blogs/lists/wikis which makes me think it's not beneficial (the benefit would be
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald kmcdon...@egenera.comwrote: On 4/2/2010 8:08 AM, Edward Ned Harvey wrote: I know it is way after the fact, but I find it best to coerce each drive down to the whole GB boundary using format (create Solaris partition just up to the boundary). Then if you ever get a drive a little smaller it still should fit. It seems like it should be unnecessary. It seems like extra work. But based on my present experience, I reached the same conclusion. If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? Nothing. That's what. Actually, It's my experience that Sun (and other vendors) do exactly that for you when you buy their parts - at least for rotating drives, I have no experience with SSD's. The Sun disk label shipped on all the drives is setup to make the drive the standard size for that sun part number. They have to do this since they (for many reasons) have many sources (diff. vendors, even diff. parts from the same vendor) for the actual disks they use for a particular Sun part number. This isn't new, I beleive IBM, EMC, HP, etc all do it also for the same reasons. I'm a little surprised that the engineers would suddenly stop doing it only on SSD's. But who knows. -Kyle If I were forced to ignorantly cast a stone, it would be into Intel's lap (if the SSD's indeed came directly from Sun). Sun's normal drive vendors have been in this game for decades, and know the expectations. Intel on the other hand, may not have quite the same QC in place yet. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Fri, Apr 2 at 11:14, Tirso Alonso wrote: If my new replacement SSD with identical part number and firmware is 0.001 Gb smaller than the original and hence unable to mirror, what's to prevent the same thing from happening to one of my 1TB spindle disk mirrors? There is a standard for sizes that many manufatures use (IDEMA LBA1-02): LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0)) Sizes should match exactly if the manufacturer follows the standard. See: http://opensolaris.org/jive/message.jspa?messageID=393336#393336 http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=downloaddata_file_id=1066 Problem is that it only applies to devices that are = 50GB in size, and the X25 in question is only 32GB. That being said, I'd be skeptical of either the sourcing of the parts, or else some other configuration feature on the drives (like HPA or DCO) that is changing the capacity. It's possible one of these is in effect. --eric -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Jeroen, Have you tried the DDRdrive from Christopher George cgeo...@ddrdrive.com? Looks to me like a much better fit for your application than the F20? It would not hurt to check it out. Looks to me like you need a product with low *latency* - and a RAM based cache would be a much better performer than any solution based solely on flash. Let us know (on the list) how this works out for you. Regards, -- Al Hopper Logical Approach Inc,Plano,TX a...@logical-approach.com Voice: 214.233.5089 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. Why do you need the rollback? The current filesystems have correct and consistent data; not different from the last two snapshots. (Snapshots can happen in the middle of untarring) The difference between running with or without ZIL is whether the client has lost data when the server reboots; not different from using Linux as an NFS server. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. Why do you need the rollback? The current filesystems have correct and consistent data; not different from the last two snapshots. (Snapshots can happen in the middle of untarring) The difference between running with or without ZIL is whether the client has lost data when the server reboots; not different from using Linux as an NFS server. If you have an ungraceful shutdown in the middle of writing stuff, while the ZIL is disabled, then you have corrupt data. Could be files that are partially written. Could be wrong permissions or attributes on files. Could be missing files or directories. Or some other problem. Some changes from the last 1 second of operation before crash might be written, while some changes from the last 4 seconds might be still unwritten. This is data corruption, which could be worse than losing a few minutes of changes. At least, if you rollback, you know the data is consistent, and you know what you lost. You won't continue having more losses afterward caused by inconsistent data on disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Can you elaborate? Just today, we got the replacement drive that has precisely the right version of firmware and everything. Still, when we plugged in that drive, and create simple volume in the storagetek raid utility, the new drive is 0.001 Gb smaller than the old drive. I'm still hosed. Are you saying I might benefit by sticking the SSD into some laptop, and zero'ing the disk? And then attach to the sun server? Are you saying I might benefit by finding some other way to make the drive available, instead of using the storagetek raid utility? Assuming you are also using a PCI LSI HBA from Sun that is managed with a utility called /opt/StorMan/arcconf and reports itself as the amazingly informative model number Sun STK RAID INT what worked for me was to run, arcconf delete (to delete the pre-configured volume shipped on the drive) arcconf create (to create a new volume) What I observed was that arcconf getconfig 1 would show the same physical device size for our existing drives and new ones from Sun, but they reported a slightly different logical volume size. I am fairly sure that was due to the Sun factory creating the initial volume with a different version of the HBA controller firmware then we where using to create our own volumes. If I remember the sign correctly, the newer firmware creates larger logical volumes, and you really want to upgrade the firmware if you are going to be running multiple X25-E drives from the same controller. I hope that helps. Uggh. This is totally different than my system. But thanks for writing. I'll take this knowledge, and see if we can find some analogous situation with the StorageTek controller. It still may be helpful, so again, thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you have an ungraceful shutdown in the middle of writing stuff, while the ZIL is disabled, then you have corrupt data. Could be files that are partially written. Could be wrong permissions or attributes on files. Could be missing files or directories. Or some other problem. Some changes from the last 1 second of operation before crash might be written, while some changes from the last 4 seconds might be still unwritten. This is data corruption, which could be worse than losing a few minutes of changes. At least, if you rollback, you know the data is consistent, and you know what you lost. You won't continue having more losses afterward caused by inconsistent data on disk. How exactly is this different from rolling back to some other point of time?. I think you don't quite understand how ZFS works; all operations are grouped in transaction groups; all the transactions in a particular group are commit in one operation. I don't know what partial ordering ZFS uses when creating transaction groups, but a snapshot just picks one transaction group as the last group included in the snapshot. When the system reboots, ZFS picks the most recent, valid uberblock; so the data available is correct upto transaction group N1. If you rollback to a snapshot, you get data correct upto transaction group N2. But N2 N1 so you lose more data. Why do you think that a Snapshot has a better quality than the last snapshot available? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you have an ungraceful shutdown in the middle of writing stuff, while the ZIL is disabled, then you have corrupt data. Could be files that are partially written. Could be wrong permissions or attributes on files. Could be missing files or directories. Or some other problem. Some changes from the last 1 second of operation before crash might be written, while some changes from the last 4 seconds might be still unwritten. This is data corruption, which could be worse than losing a few minutes of changes. At least, if you rollback, you know the data is consistent, and you know what you lost. You won't continue having more losses afterward caused by inconsistent data on disk. How exactly is this different from rolling back to some other point of time?. I think you don't quite understand how ZFS works; all operations are grouped in transaction groups; all the transactions in a particular group are commit in one operation. I don't know what partial ordering ZFS Dude, don't be so arrogant. Acting like you know what I'm talking about better than I do. Face it that you have something to learn here. Yes, all the transactions in a transaction group are either committed entirely to disk, or not at all. But they're not necessarily committed to disk in the same order that the user level applications requested. Meaning: If I have an application that writes to disk in sync mode intentionally ... perhaps because my internal file format consistency would be corrupt if I wrote out-of-order ... If the sysadmin has disabled ZIL, my sync write will not block, and I will happily issue more write operations. As long as the OS remains operational, no problem. The OS keeps the filesystem consistent in RAM, and correctly manages all the open file handles. But if the OS dies for some reason, some of my later writes may have been committed to disk while some of my earlier writes could be lost, which were still being buffered in system RAM for a later transaction group. This is particularly likely to happen, if my application issues a very small sync write, followed by a larger async write, followed by a very small sync write, and so on. Then the OS will buffer my small sync writes and attempt to aggregate them into a larger sequential block for the sake of accelerated performance. The end result is: My larger async writes are sometimes committed to disk before my small sync writes. But the only reason I would ever know or care about that would be if the ZIL were disabled, and the OS crashed. Afterward, my file has internal inconsistency. Perfect examples of applications behaving this way would be databases and virtual machines. Why do you think that a Snapshot has a better quality than the last snapshot available? If you rollback to a snapshot from several minutes ago, you can rest assured all the transaction groups that belonged to that snapshot have been committed. So although you're losing the most recent few minutes of data, you can rest assured you haven't got file corruption in any of the existing files. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
This approach does not solve the problem. When you do a snapshot, the txg is committed. If you wish to reduce the exposure to loss of sync data and run with ZIL disabled, then you can change the txg commit interval -- however changing the txg commit interval will not eliminate the possibility of data loss. The default commit interval is what, 30 seconds? Doesn't that guarantee that any snapshot taken more than 30 seconds ago will have been fully committed to disk? Therefore, any snapshot older than 30 seconds old is guaranteed to be consistent on disk. While anything less than 30 seconds old could possibly have some later writes committed to disk before some older writes from a few seconds before. If I'm wrong about this, please explain. I am envisioning a database, which issues a small sync write, followed by a larger async write. Since the sync write is small, the OS would prefer to defer the write and aggregate into a larger block. So the possibility of the later async write being committed to disk before the older sync write is a real risk. The end result would be inconsistency in my database file. If you rollback to a snapshot that's at least 30 seconds old, then all the writes for that snapshot are guaranteed to be committed to disk already, and in the right order. You're acknowledging the loss of some known time worth of data. But you're gaining a guarantee of internal file consistency. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Dude, don't be so arrogant. Acting like you know what I'm talking about better than I do. Face it that you have something to learn here. You may say that, but then you post this: Why do you think that a Snapshot has a better quality than the last snapshot available? If you rollback to a snapshot from several minutes ago, you can rest assured all the transaction groups that belonged to that snapshot have been committed. So although you're losing the most recent few minutes of data, you can rest assured you haven't got file corruption in any of the existing files. But the actual fact is that there is *NO* difference between the last uberblock and an uberblock named as snapshot-such-and-so. All changes made after the uberblock was written are discarded by rolling back. All the transaction groups referenced by last uberblock *are* written to disk. Disabling the ZIL makes sure that fsync() and sync() no longer work; whether you take a named snapshot or the uberblock is immaterial; your strategy will cause more data to be lost. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. We're talking about the sync for NFS exports in Linux; what do they mean with sync NFS exports? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
This approach does not solve the problem. When you do a snapshot, the txg is committed. If you wish to reduce the exposure to loss of sync data and run with ZIL disabled, then you can change the txg commit interval -- however changing the txg commit interval will not eliminate the possibility of data loss. The default commit interval is what, 30 seconds? Doesn't that guarantee that any snapshot taken more than 30 seconds ago will have been fully committed to disk? When a system boots and it finds the snapshot, then all the data referred by the snapshot are on-disk. But the snapshot doesn't guarantee more than the last valid uberblock. Therefore, any snapshot older than 30 seconds old is guaranteed to be consistent on disk. While anything less than 30 seconds old could possibly have some later writes committed to disk before some older writes from a few seconds before. If I'm wrong about this, please explain. When a pointer to data is committed to disk by ZFS, then the data is also on disk. (if the pointer is reachable from the uberblock, then the data is also on dissk and reachable from the uberblock) You don't need to wait 30 seconds. If it's there, it's there. I am envisioning a database, which issues a small sync write, followed by a larger async write. Since the sync write is small, the OS would prefer to defer the write and aggregate into a larger block. So the possibility of the later async write being committed to disk before the older sync write is a real risk. The end result would be inconsistency in my database file. If you rollback to a snapshot that's at least 30 seconds old, then all the writes for that snapshot are guaranteed to be committed to disk already, and in the right order. You're acknowledging the loss of some known time worth of data. But you're gaining a guarantee of internal file consistency. I don't know what ZFS guarantees when you disable the zil; the one broken promise is that when fsync() returns, that the data may not have committed to stable storage when fsync() returns. I'm not sure whether there is a barrier when there is a sync()/fsync(), if that is the case, then ZFS is still safe for your application. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey solar...@nedharvey.com wrote: A MegaRAID card with write-back cache? It should also be cheaper than the F20. I haven't posted results yet, but I just finished a few weeks of extensive benchmarking various configurations. I can say this: WriteBack cache is much faster than naked disks, but if you can buy an SSD or two for ZIL log device, the dedicated ZIL is yet again much faster than WriteBack. It doesn't have to be F20. You could use the Intel X25 for example. If you're running solaris proper, you better mirror your ZIL log device. If you're running opensolaris ... I don't know if that's important. I'll probably test it, just to be sure, but I might never get around to it because I don't have a justifiable business reason to build the opensolaris machine just for this one little test. Seriously, all disks configured WriteThrough (spindle and SSD disks alike) using the dedicated ZIL SSD device, very noticeably faster than enabling the WriteBack. What do you get with both SSD ZIL and WriteBack disks enabled? I mean if you have both why not use both? Then both async and sync IO benefits. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey solar...@nedharvey.com wrote: We ran into something similar with these drives in an X4170 that turned out to be an issue of the preconfigured logical volumes on the drives. Once we made sure all of our Sun PCI HBAs where running the exact same version of firmware and recreated the volumes on new drives arriving from Sun we got back into sync on the X25-E devices sizes. Can you elaborate? Just today, we got the replacement drive that has precisely the right version of firmware and everything. Still, when we plugged in that drive, and create simple volume in the storagetek raid utility, the new drive is 0.001 Gb smaller than the old drive. I'm still hosed. Are you saying I might benefit by sticking the SSD into some laptop, and zero'ing the disk? And then attach to the sun server? Are you saying I might benefit by finding some other way to make the drive available, instead of using the storagetek raid utility? I know it is way after the fact, but I find it best to coerce each drive down to the whole GB boundary using format (create Solaris partition just up to the boundary). Then if you ever get a drive a little smaller it still should fit. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Apr 1, 2010, at 8:42 AM, casper@sun.com wrote: Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. We're talking about the sync for NFS exports in Linux; what do they mean with sync NFS exports? See section A1 in the FAQ: http://nfs.sourceforge.net/ -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 14:49, Ross Walker wrote: We're talking about the sync for NFS exports in Linux; what do they mean with sync NFS exports? See section A1 in the FAQ: http://nfs.sourceforge.net/ I think B4 is the answer to Casper's question: BEGIN QUOTE Linux servers (although not the Solaris reference implementation) allow this requirement to be relaxed by setting a per-export option in /etc/exports. The name of this export option is [a]sync (note that there is also a client-side mount option by the same name, but it has a different function, and does not defeat NFS protocol compliance). When set to sync, Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to async, the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery. END QUOTE For more info the whole of section B4 though B6. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat darr...@opensolaris.org wrote: On 01/04/2010 14:49, Ross Walker wrote: We're talking about the sync for NFS exports in Linux; what do they mean with sync NFS exports? See section A1 in the FAQ: http://nfs.sourceforge.net/ I think B4 is the answer to Casper's question: BEGIN QUOTE Linux servers (although not the Solaris reference implementation) allow this requirement to be relaxed by setting a per-export option in /etc/exports. The name of this export option is [a]sync (note that there is also a client-side mount option by the same name, but it has a different function, and does not defeat NFS protocol compliance). When set to sync, Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to async, the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery. END QUOTE For more info the whole of section B4 though B6. True, I was thinking more of the protocol summary. Is that what sync means in Linux? As NFS doesn't use close or fsync, what exactly are the semantics. (For NFSv2/v3 each *operation* is sync and the client needs to make sure it can continue; for NFSv4, some operations are async and the client needs to use COMMIT) Actually the COMMIT command was introduced in NFSv3. The full details: NFS Version 3 introduces the concept of safe asynchronous writes. A Version 3 client can specify that the server is allowed to reply before it has saved the requested data to disk, permitting the server to gather small NFS write operations into a single efficient disk write operation. A Version 3 client can also specify that the data must be written to disk before the server replies, just like a Version 2 write. The client specifies the type of write by setting the stable_how field in the arguments of each write operation to UNSTABLE to request a safe asynchronous write, and FILE_SYNC for an NFS Version 2 style write. Servers indicate whether the requested data is permanently stored by setting a corresponding field in the response to each NFS write operation. A server can respond to an UNSTABLE write request with an UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the requested data resides on permanent storage yet. An NFS protocol-compliant server must respond to a FILE_SYNC request only with a FILE_SYNC reply. Clients ensure that data that was written using a safe asynchronous write has been written onto permanent storage using a new operation available in Version 3 called a COMMIT. Servers do not send a response to a COMMIT operation until all data specified in the request has been written to permanent storage. NFS Version 3 clients must protect buffered data that has been written using a safe asynchronous write but not yet committed. If a server reboots before a client has sent an appropriate COMMIT, the server can reply to the eventual COMMIT request in a way that forces the client to resend the original write operation. Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Thu, 1 Apr 2010, Edward Ned Harvey wrote: If I'm wrong about this, please explain. I am envisioning a database, which issues a small sync write, followed by a larger async write. Since the sync write is small, the OS would prefer to defer the write and aggregate into a larger block. So the possibility of the later async write being committed to disk before the older sync write is a real risk. The end result would be inconsistency in my database file. Zfs writes data in transaction groups and each bunch of data which gets written is bounded by a transaction group. The current state of the data at the time the TXG starts will be the state of the data once the TXG completes. If the system spontaneously reboots then it will restart at the last completed TXG so any residual writes which might have occured while a TXG write was in progress will be discarded. Based on this, I think that your ordering concerns (sync writes getting to disk faster than async writes) are unfounded for normal file I/O. However, if file I/O is done via memory mapped files, then changed memory pages will not necessarily be written. The changes will not be known to ZFS until the kernel decides that a dirty page should be written or there is a conflicting traditional I/O which would update the same file data. Use of msync(3C) is necessary to assure that file data updated via mmap() will be seen by ZFS and comitted to disk in an orderly fashion. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 13:01, Edward Ned Harvey wrote: Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. ROTFL!!! I think you should explain it even further for Casper :) :) :) :) :) :) :) -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 13:01, Edward Ned Harvey wrote: Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. ROTFL!!! I think you should explain it even further for Casper :) :) :) :) :) :) :) :-) So what I *really* wanted to know what sync meant for the NFS server in the case of Linux. Apparently it means implement the NFS protocol to the letter. I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Thu, 1 Apr 2010, Edward Ned Harvey wrote: Dude, don't be so arrogant. Acting like you know what I'm talking about better than I do. Face it that you have something to learn here. Geez! Yes, all the transactions in a transaction group are either committed entirely to disk, or not at all. But they're not necessarily committed to disk in the same order that the user level applications requested. Meaning: If I have an application that writes to disk in sync mode intentionally ... perhaps because my internal file format consistency would be corrupt if I wrote out-of-order ... If the sysadmin has disabled ZIL, my sync write will not block, and I will happily issue more write operations. As long as the OS remains operational, no problem. The OS keeps the filesystem consistent in RAM, and correctly manages all the open file handles. But if the OS dies for some reason, some of my later writes may have been committed to disk while some of my earlier writes could be lost, which were still being buffered in system RAM for a later transaction group. The purpose of the ZIL is to act like a fast log for synchronous writes. It allows the system to quickly confirm a synchronous write request with the minimum amount of work. As you say, OS keeps the filesystem consistent in RAM. There is no 1:1 ordering between application write requests and zfs writes and in fact, if the same portion of file is updated many times, or the file is created/deleted many times, zfs only writes the updated data which is current when the next TXG is written. For a synchronous write, zfs advances its index in the slog once the corresponding data has been committed in a TXG. In other words, the sync and async write paths are the same when it comes to writing final data to disk. There is however the recovery case where synchronous writes were affirmed which were not yet written in a TXG and the system spontaneously reboots. In this case the synchronous writes will occur based on the slog, and uncommitted async writes will have been lost. Perhaps this is the case you are worried about. It does seem like rollback to a snapshot does help here (to assure that sync async data is consistent), but it certainly does not help any NFS clients. Only a broken application uses sync writes sometimes, and async writes at other times. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
It does seem like rollback to a snapshot does help here (to assure that sync async data is consistent), but it certainly does not help any NFS clients. Only a broken application uses sync writes sometimes, and async writes at other times. But doesn't that snapshot possibly have the same issues? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Thu, 1 Apr 2010, casper@sun.com wrote: It does seem like rollback to a snapshot does help here (to assure that sync async data is consistent), but it certainly does not help any NFS clients. Only a broken application uses sync writes sometimes, and async writes at other times. But doesn't that snapshot possibly have the same issues? No, at least not based on my understanding. My understanding is that zfs uses uniform prioritization of updates and performs writes in order (at least to the level of a TXG). If this is true, then each normal TXG will be a coherent representation of the filesystem. If the slog is used to recover uncommitted writes, then the TXG based on that may not match the in-memory filesystem at the time of the crash since async writes may have been lost. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
hello i have had this problem this week. our zil ssd died (apt slc ssd 16gb). because we had no spare drive in stock, we ignored it. then we decided to update our nexenta 3 alpha to beta, exported the pool and made a fresh install to have a clean system and tried to import the pool. we only got a error message about a missing drive. we googled about this and it seems there is no way to acces the pool !!! (hope this will be fixed in future) we had a backup and the data are not so important, but that could be a real problem. you have a valid zfs3 pool and you cannot access your data due to missing zil. gea www.napp-it.org zfs server -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Casper, :-) Leuk te zien dat je straal nog steeds even ver komt :-) I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... Anyway we seem to be getting of topic here :-) The thread was started to get insight in behaviour of the F20 as ZIL. _My_ particular interest would be to be able to answer why perfomance doesn't seem to scale up when adding vmod-s... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Jeroen Roodhart wrote: The thread was started to get insight in behaviour of the F20 as ZIL. _My_ particular interest would be to be able to answer why perfomance doesn't seem to scale up when adding vmod-s... My best guess would be latency. If you are latency bound, adding additional parallel devices with the same latency will make no difference. It will improve throughput, but may actually make latency worse (additional time to select which parallel device to use). But one of the ZFS gurus may be able to provide a better answer, or some dtrace foo to confirm/deny my thesis. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
It doesn't have to be F20. You could use the Intel X25 for example. The mlc-based disks are bound to be too slow (we tested with an OCZ Vertex Turbo). So you're stuck with the X25-E (which Sun stopped supporting for some reason). I believe most normal SSDs do have some sort of cache and usually no supercap or other backup power solution. So be wary of that. Having said all this, the new Sandforce based SSDs look promising... If you're running solaris proper, you better mirror your ZIL log device. Absolutely true, I forgot this 'cause we're running OSOL nv130... (we constantly seem to need features that haven't landed in Solaris proper :) ). If you're running opensolaris ... I don't know if that's important. At least I can confirm ability of adding and removing ZIL devices on the fly with OSOL of a sufficiently recent build. I'll probably test it, just to be sure, but I might never get around to it because I don't have a justifiable business reason to build the opensolaris machine just for this one little test. I plan to get test this as well, won't be until late next week though. With kind regards, Jeroen Message was edited by: tuxwield -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
enh == Edward Ned Harvey solar...@nedharvey.com writes: enh Dude, don't be so arrogant. Acting like you know what I'm enh talking about better than I do. Face it that you have enh something to learn here. funny! AIUI you are wrong and Casper is right. ZFS recovers to a crash-consistent state, even without the slog, meaning it recovers to some state through which the filesystem passed in the seconds leading up to the crash. This isn't what UFS or XFS do. The on-disk log (slog or otherwise), if I understand right, can actually make the filesystem recover to a crash-INconsistent state (a state not equal to a snapshot you might have hypothetically taken in the seconds leading up to the crash), because files that were recently fsync()'d may be of newer versions than files that weren't---that is, fsync() durably commits only the file it references, by copying that *part* of the in-RAM ZIL to the durable slog. fsync() is not equivalent to 'lockfs -fa' committing every file on the system (is it?). I guess I could be wrong about that. If I'm right, this isn't a bad thing because apps that call fsync() are supposed to expect the inconsistency, but it's still important to understanding what's going on. pgpUNxWo30EYO.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 20:58, Jeroen Roodhart wrote: I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... Which is to be expected as it is not a nfs client which requests the behavior but rather a nfs server. Currently on Linux you can export a share with as sync (default) or async share while on Solaris you can't really currently force a NFS server to start working in an async mode. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption. Hmmm, I thought that in that case ZFS reverts to the regular on disks ZIL? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The write cache is _not_ being disabled. The write cache is being marked as non-volatile. Of course you're right :) Please filter my postings with a sed 's/write cache/write cache flush/g' ;) BTW, why is a Sun/Oracle branded product not properly respecting the NV bit in the cache flush command? This seems remarkably broken, and leads to the amazingly bad advice given on the wiki referenced above. I suspect it has something to do with emulating disk semantics over PCIE. Anyway, this did get us stumped in the beginning, performance wasn't better than when using an OCZ Vertex Turbo ;) By the way, the URL to the reference is part of the official F20 product documentation (that's how we found it in the first place)... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
I stand corrected. You don't lose your pool. You don't have corrupted filesystem. But you lose whatever writes were not yet completed, so if those writes happen to be things like database transactions, you could have corrupted databases or files, or missing files if you were creating them at the time, and stuff like that. AKA, data corruption. But not pool corruption, and not filesystem corruption. Yeah, that's a big difference! :) Of course we could not live with pool or fs corruption. However, we can live with the fact the NFS written data is not all on disk in case of a server crash although the NFS client could rely on the write guaranteed by the NFS protocol. I.e. we do not use it for db transactions or something like that. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Adam, Very interesting data. Your test is inherently single-threaded so I'm not surprised that the benefits aren't more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Thanks for your reply. I'll probably test the multiple write case, too. But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there's anything else we can do/try to improve the single-threaded case I'm all ears. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss k.we...@science-computing.de wrote: Hi Adam, Very interesting data. Your test is inherently single-threaded so I'm not surprised that the benefits aren't more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Thanks for your reply. I'll probably test the multiple write case, too. But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there's anything else we can do/try to improve the single-threaded case I'm all ears. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. You'd be better off getting NetApp -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Brent Jones wrote: I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. A few days ago I posted to nfs-discuss with a proposal to add some mount/share options to change semantics of a nfs-mounted filesystem so that they parallel those of a local filesystem. The main point is that data gets flushed to stable storage only if the client explicitly requests so via fsync or O_DSYNC, not implicitly with every close(). That would give you the performance you are seeking without sacrificing data integrity for applications that need it. I get the impression that I'm not the only one who could be interested in that ;) -Arne You'd be better off getting NetApp ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Nobody knows any way for me to remove my unmirrored log device. Nobody knows any way for me to add a mirror to it (until Since snv_125 you can remove log devices. See http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 I've used this all the time during my testing and was able to remove both mirrored and unmirrored log devices without any problems (and without reboot). I'm using snv_134. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. You'd be better off getting NetApp Well, spend some extra money on a really fast NVRAM solution for ZIL and you will get much faster ZFS environment than NetApp and still you will spend much less money. Not to mention all the extra flexibity compared to NetApp. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Just to make sure you know ... if you disable the ZIL altogether, and you have a power interruption, failed cpu, or kernel halt, then you're likely to have a corrupt unusable zpool, or at least data corruption. If that is indeed acceptable to you, go nuts. ;-) I believe that the above is wrong information as long as the devices involved do flush their caches when requested to. Zfs still writes data in order (at the TXG level) and advances to the next transaction group when the devices written to affirm that they have flushed their cache. Without the ZIL, data claimed to be synchronously written since the previous transaction group may be entirely lost. If the devices don't flush their caches appropriately, the ZIL is irrelevant to pool corruption. I stand corrected. You don't lose your pool. You don't have corrupted filesystem. But you lose whatever writes were not yet completed, so if those writes happen to be things like database transactions, you could have corrupted databases or files, or missing files if you were creating them at the time, and stuff like that. AKA, data corruption. But not pool corruption, and not filesystem corruption. Which is an expected behavior when you break NFS requirements as Linux does out of the box. Disabling ZIL on a nfs server makes it no worse than the standard Linux behaviour - now you get decent performance at a cost of some data to get corrupted from a nfs client point of view. But then there are environments when it is perfectly acceptable as you there are not running critical databases but rather user home directories and zfs will flush a transaction maximum after 30s currently so user won't be able to loose more than last 30s if the nfs server would suddenly lost power. To clarify - if ZIL is disabled it makes no difference at all for a pool/filesystem level consistency. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
standard ZIL: 7m40s (ZFS default) 1x SSD ZIL: 4m07s (Flash Accelerator F20) 2x SSD ZIL: 2m42s (Flash Accelerator F20) 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) 3x SSD ZIL: 2m47s (Flash Accelerator F20) 4x SSD ZIL: 2m57s (Flash Accelerator F20) disabled ZIL: 0m15s (local extraction0m0.269s) I was not so much interested in the absolute numbers but rather in the relative performance differences between the standard ZIL, the SSD ZIL and the disabled ZIL cases. Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption. This is not true. If ZIL device would die while pool is imported then ZFS would start using z ZIL withing a pool and continue to operate. On the other hand if your server would suddenly lost power and then when you power it up later on and ZFS detects that the ZIL is broken/gone it will require a sysadmin intervation to force the pool import and yes possibly loose some data. But how is it different from any other solution where your log is put on a separate device? Well, it is actually different. With ZFS you can still guearantee it to be consistent on-disk while others generally can't and often you will have to do fsck to even mount a fs in r/w... -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Jeroen, Adam! link. Switched write caching off with the following addition to the /kernel/drv/sd.conf file (Karsten: if you didn't do this already, you _really_ want to :) Okay, I bite! :) format-inquiry on the F20 FMods disks returns: # Vendor: ATA # Product: MARVELL SD88SA02 So I put this in /kernel/drv/sd.conf and rebooted: # KAW, 2010-03-31 # Set F20 FMod devices to non-volatile mode # See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes sd-config-list = ATA MARVELL SD88SA02, nvcache1; nvcache1=1, 0x4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1; Now the tarball extraction test with active ZIL finishes in ~0m32s! I've tested with a mirrored SSD log and two separate SSD log devices. The runtime is nearly the same. Compared to the 2m42s before the /kernel/drv/sd.conf modification this is a huge improvement. The performance with active ZIL would be acceptable now. But is this mode of operation *really* safe? FWIW zilstat during the test shows this: N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops =4kB 4-32kB =32kB 0 0 0 0 0 0 0 0 0 0 103907210390721039072377241637724163772416610299 311 0 152249615224961522496540262454026245402624874429 445 0 229295222929522292952674611267461126746112931215 716 0 232127223212722321272677478467747846774784931208 723 0 230347223034722303472654950465495046549504897195 702 0 632632632673382467338246733824935226 709 0 219832821983282198328666828866682886668288926224 702 0 217217217637337663733766373376878200 678 0 218541621854162185416635289663528966352896874197 677 0 221804022180402218040651673665167366516736897203 694 0 243698424369842436984654950465495046549504885171 714 0 I.e. ~900 ops/s. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. You'd be better off getting NetApp Hah hah. I have a Sun X4275 server exporting NFS. We have clients on all 4 of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or filesystem. I suggest you either enable the WriteBack cache on your HBA, or add SSD's for ZIL. Performance is 5-10x higher this way than using naked disks. But of course, not as high as it is with a disabled ZIL. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Karsten, But is this mode of operation *really* safe? As far as I can tell it is. -The F20 uses some form of power backup that should provide power to the interface card long enough to get the cache onto solid state in case of power failure. -Recollecting from earlier threads here; in case the card fails (but not the host), there should be enough data residing in memory for ZFS to safely switch to the regular on disk ZIL. -According to my contacts at Sun, the F20 is a viable replacement solution for the X25-E. -Switching write caching off seems to be officially recommended on the Sun performance wiki (translated to more sane defaults). If I'm wrong here I'd like to know too, 'cause this is probably the way we're taking it in production. :) With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss