Re: [zfs-discuss] hard drive write cache
ZFS enables the write cache and flushes it when committing transaction groups; this insures that all of a transaction group appears or does not appear on disk. It also flushes the disk write cache before returning from every synchronous request (eg fsync, O_DSYNC). This is done after writing out the intent log blocks. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Well this does look more and more like a duplicate of: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Torrey McMahon wrote On 06/21/06 10:29,: Roch wrote: Sean Meighan writes: The vi we were doing was a 2 line file. If you just vi a new file, add one line and exit it would take 15 minutes in fdsynch. On recommendation of a workaround we set set zfs:zil_disable=1 after the reboot the fdsynch is now 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. well behaved...In appearance only ! Maybe it's nice to validate hypothesis but you should not run with this option set, ever., it disable O_DSYNC and fsync() and I don't know what else. Bad idea, bad. Why is this option available then? (Yes, that's a loaded question.) I wouldn't call it an option, but an internal debugging switch that I originally added to allow progress when initially integrating the ZIL. As Roch says it really shouldn't be ever set (as it does negate POSIX synchronous semantics). Nor should it be mentioned to a customer. In fact I'm inclined to now remove it - however it does still have a use as it helped root cause this problem. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Robert Milkowski wrote On 06/21/06 11:09,: Hello Neil, Why is this option available then? (Yes, that's a loaded question.) NP I wouldn't call it an option, but an internal debugging switch that I NP originally added to allow progress when initially integrating the ZIL. NP As Roch says it really shouldn't be ever set (as it does negate POSIX NP synchronous semantics). Nor should it be mentioned to a customer. NP In fact I'm inclined to now remove it - however it does still have a use NP as it helped root cause this problem. Isn't it similar to unsupported fastfs for ufs? It is similar in the sense that it speeds up the file system. Using fastfs can be much more dangerous though as it can lead to a badly corrupted file system as writing meta data is delayed and written out of order. Whereas disabling the ZIL does not affect the integrity of the fs. The transaction group model of ZFS gives consistency in the event of a crash/power fail. However, any data that was promised to be on stable storage may not be unless the transaction group committed (an operation that is started every 5s). We once had plans to add a mount option to allow the admin to control the ZIL. Here's a brief section of the RFE (6280630): sync={deferred,standard,forced} Controls synchronous semantics for the dataset. When set to 'standard' (the default), synchronous operations such as fsync(3C) behave precisely as defined in fcntl.h(3HEAD). When set to 'deferred', requests for synchronous semantics are ignored. However, ZFS still guarantees that ordering is preserved -- that is, consecutive operations reach stable storage in order. (If a thread performs operation A followed by operation B, then the moment that B reaches stable storage, A is guaranteed to be on stable storage as well.) ZFS also guarantees that all operations will be scheduled for write to stable storage within a few seconds, so that an unexpected power loss only takes the last few seconds of change with it. When set to 'forced', all operations become synchronous. No operation will return until all previous operations have been committed to stable storage. This option can be useful if an application is found to depend on synchronous semantics without actually requesting them; otherwise, it will just make everything slow, and is not recommended. Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS
Chris, The data will be written twice on ZFS using NFS. This is because NFS on closing the file internally uses fsync to cause the writes to be committed. This causes the ZIL to immediately write the data to the intent log. Later the data is also written committed as part of the pools transaction group commit, at which point the intent block blocks are freed. It does seem inefficient to doubly write the data. In fact for blocks larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) we write the data block and also an intent log record with the block pointer. During txg commit we link this block into the pool tree. By experimentation we found 32K to be the (current) cutoff point. As the nfsd at most write 32K they do not benefit from this. Anyway this is an area we are actively working on. Neil. Chris Csanady wrote On 06/23/06 23:45,: While dd'ing to an nfs filesystem, half of the bandwidth is unaccounted for. What dd reports amounts to almost exactly half of what zpool iostat or iostat show; even after accounting for the overhead of the two mirrored vdevs. Would anyone care to guess where it may be going? (This is measured over 10 second intervals. For 1 second intervals, the bandwidth to the disks jumps around from 40MB/s to 240MB/s) With a local dd, everything adds up. This is with a b41 server, and a MacOS 10.4 nfs client. I have verified that the bandwidth at the network interface is approximately that reported by dd, so the issue would appear to be within the server. Any suggestions would be welcome. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS
Robert Milkowski wrote On 06/25/06 04:12,: Hello Neil, Saturday, June 24, 2006, 3:46:34 PM, you wrote: NP Chris, NP The data will be written twice on ZFS using NFS. This is because NFS NP on closing the file internally uses fsync to cause the writes to be NP committed. This causes the ZIL to immediately write the data to the intent log. NP Later the data is also written committed as part of the pools transaction group NP commit, at which point the intent block blocks are freed. NP It does seem inefficient to doubly write the data. In fact for blocks NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) NP we write the data block and also an intent log record with the block pointer. NP During txg commit we link this block into the pool tree. By experimentation NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K NP they do not benefit from this. Is 32KB easily tuned (mdb?)? I'm not sure. NFS folk? I guess not but perhaps. And why only for blocks larger than zfs_immediate_write_sz? When data is large enough (currently 32K) it's more efficient to directly write the block, and additionally save the block pointer in a ZIL record. Otherwise it's more efficient to copy the data into a large log block potentially along with other writes. -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bandwidth disparity between NFS and ZFS
Robert Milkowski wrote On 06/27/06 03:00,: Hello Chris, Tuesday, June 27, 2006, 1:07:31 AM, you wrote: CC On 6/26/06, Neil Perrin [EMAIL PROTECTED] wrote: Robert Milkowski wrote On 06/25/06 04:12,: Hello Neil, Saturday, June 24, 2006, 3:46:34 PM, you wrote: NP Chris, NP The data will be written twice on ZFS using NFS. This is because NFS NP on closing the file internally uses fsync to cause the writes to be NP committed. This causes the ZIL to immediately write the data to the intent log. NP Later the data is also written committed as part of the pools transaction group NP commit, at which point the intent block blocks are freed. NP It does seem inefficient to doubly write the data. In fact for blocks NP larger than zfs_immediate_write_sz (was 64K but now 32K after 6440499 fixed) NP we write the data block and also an intent log record with the block pointer. NP During txg commit we link this block into the pool tree. By experimentation NP we found 32K to be the (current) cutoff point. As the nfsd at most write 32K NP they do not benefit from this. Is 32KB easily tuned (mdb?)? I'm not sure. NFS folk? CC I think he is referring to the zfs_immediate_write_sz variable, but Exactly, I was asking about this not NFS. Sorry for the confusion. The zfs_immediate_write_sz varaible was meant for internal use and not really intended for public tuning. However, yes it could be tuned dynamically anytime using mdb, or set in /etc/system -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supporting ~10K users on ZFS
[EMAIL PROTECTED] wrote On 06/27/06 17:17,: We have over 1 filesystems under /home in strongspace.com and it works fine. I forget but there was a bug or there was an improvement made around nevada build 32 (we're currently at 41) that made the initial mount on reboot significantly faster. Before that it was around 10-15 minutes. I wonder if that improvement didn't make it into sol10U2? That fix (bug 6377670) made it into build 34 and S10_U2. -Jason Sent via BlackBerry from Cingular Wireless -Original Message- From: eric kustarz [EMAIL PROTECTED] Date: Tue, 27 Jun 2006 15:55:45 To:Steve Bennett [EMAIL PROTECTED] Cc:zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Supporting ~10K users on ZFS Steve Bennett wrote: OK, I know that there's been some discussion on this before, but I'm not sure that any specific advice came out of it. What would the advice be for supporting a largish number of users (10,000 say) on a system that supports ZFS? We currently use vxfs and assign a user quota, and backups are done via Legato Networker. Using lots of filesystems is definitely encouraged - as long as doing so makes sense in your environment. From what little I currently understand, the general advice would seem to be to assign a filesystem to each user, and to set a quota on that. I can see this being OK for small numbers of users (up to 1000 maybe), but I can also see it being a bit tedious for larger numbers than that. I just tried a quick test on Sol10u2: for x in 0 1 2 3 4 5 6 7 8 9; do for y in 0 1 2 3 4 5 6 7 8 9; do zfs create testpool/$x$y; zfs set quota=1024k testpool/$x$y done; done [apologies for the formatting - is there any way to preformat text on this forum?] It ran OK for a minute or so, but then I got a slew of errors: cannot mount '/testpool/38': unable to create mountpoint filesystem successfully created, but not mounted So, OOTB there's a limit that I need to raise to support more than approx 40 filesystems (I know that this limit can be raised, I've not checked to see exactly what I need to fix). It does beg the question of why there's a limit like this when ZFS is encouraging use of large numbers of filesystems. There is no 40 filesystem limit. You most likely had a pre-existing file/directory in testpool of the same name of the filesystem you tried to create. fsh-hake# zfs list NAME USED AVAIL REFER MOUNTPOINT testpool77K 7.81G 24.5K /testpool fsh-hake# echo hmm /testpool/01 fsh-hake# zfs create testpool/01 cannot mount 'testpool/01': Not a directory filesystem successfully created, but not mounted fsh-hake# If I have 10,000 filesystems, is the mount time going to be a problem? I tried: for x in 0 1 2 3 4 5 6 7 8 9; do for x in 0 1 2 3 4 5 6 7 8 9; do zfs umount testpool/001; zfs mount testpool/001 done; done This took 12 seconds, which is OK until you scale it up - even if we assume that mount and unmount take the same amount of time, so 100 mounts will take 6 seconds, this means that 10,000 mounts will take 5 minutes. Admittedly, this is on a test system without fantastic performance, but there *will* be a much larger delay on mounting a ZFS pool like this over a comparable UFS filesystem. So this really depends on why and when you're unmounting filesystems. I suspect it won't matter much since you won't be unmounting/remounting your filesystems. I currently use Legato Networker, which (not unreasonably) backs up each filesystem as a separate session - if I continue to use this I'm going to have 10,000 backup sessions on each tape backup. I'm not sure what kind of challenges restoring this kind of beast will present. Others have already been through the problems with standard tools such as 'df' becoming less useful. Is there a specific problem you had in mind regarding 'df;? One alternative is to ditch quotas altogether - but even though disk is cheap, it's not free, and regular backups take time (and tapes are not free either!). In any case, 10,000 undergraduates really will be able to fill more disks than we can afford to provision. We tried running a Windows fileserver back in the days when it had no support for per-user quotas; we did some ad-hockery that helped to keep track of the worst offenders (ableit after the event), but what really killed us was the uncertainty over whether some idiot would decide to fill all available space with vital research data (or junk, depending on your point of view). I can see the huge benefits that ZFS quotas and reservations can bring, but I can also see that there is a possibility that there are situations where ZFS could be useful, but the lack of 'legacy' user-based quotas make it impractical. If the ZFS developers really are not going to implement user quotas is there any advice on what someone like me could do - at the moment I'm presuming that I'll just have to leave ZFS
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Robert Milkowski wrote On 06/28/06 15:52,: Hello Neil, Wednesday, June 21, 2006, 8:15:54 PM, you wrote: NP Robert Milkowski wrote On 06/21/06 11:09,: Hello Neil, Why is this option available then? (Yes, that's a loaded question.) NP I wouldn't call it an option, but an internal debugging switch that I NP originally added to allow progress when initially integrating the ZIL. NP As Roch says it really shouldn't be ever set (as it does negate POSIX NP synchronous semantics). Nor should it be mentioned to a customer. NP In fact I'm inclined to now remove it - however it does still have a use NP as it helped root cause this problem. Isn't it similar to unsupported fastfs for ufs? NP It is similar in the sense that it speeds up the file system. NP Using fastfs can be much more dangerous though as it can lead NP to a badly corrupted file system as writing meta data is delayed NP and written out of order. Whereas disabling the ZIL does not affect NP the integrity of the fs. The transaction group model of ZFS gives NP consistency in the event of a crash/power fail. However, any data that NP was promised to be on stable storage may not be unless the transaction NP group committed (an operation that is started every 5s). NP We once had plans to add a mount option to allow the admin NP to control the ZIL. Here's a brief section of the RFE (6280630): NP sync={deferred,standard,forced} NP Controls synchronous semantics for the dataset. NP When set to 'standard' (the default), synchronous operations NP such as fsync(3C) behave precisely as defined in NP fcntl.h(3HEAD). NP When set to 'deferred', requests for synchronous semantics NP are ignored. However, ZFS still guarantees that ordering NP is preserved -- that is, consecutive operations reach stable NP storage in order. (If a thread performs operation A followed NP by operation B, then the moment that B reaches stable storage, NP A is guaranteed to be on stable storage as well.) ZFS also NP guarantees that all operations will be scheduled for write to NP stable storage within a few seconds, so that an unexpected NP power loss only takes the last few seconds of change with it. NP When set to 'forced', all operations become synchronous. NP No operation will return until all previous operations NP have been committed to stable storage. This option can be NP useful if an application is found to depend on synchronous NP semantics without actually requesting them; otherwise, it NP will just make everything slow, and is not recommended. NP Of course we would need to stress the dangers of setting 'deferred'. NP What do you guys think? I think it would be really useful. I found myself many times in situation that such features (like fastfs) were my last resort help. The over-whelming consensus was that it would be useful. So I'll go ahead and put that on my to do list. The same with txg_time - in some cases tuning it could probably be useful. Instead of playing with mdb it would be much better put into zpool/zfs or other util (and if possible made per fs not per host). This one I'm less sure about. I have certainly tuned txg_time myself to force certain situations, but I wouldn't be happy exposing the inner workings of ZFS - which may well change. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zvol Performance
This is change request: 6428639 large writes to zvol synchs too much, better cut down a little which I have a fix for, but it hasn't been put back. Neil. Jürgen Keil wrote On 07/17/06 04:18,: Further testing revealed that it wasn't an iSCSI performance issue but a zvol issue. Testing on a SATA disk locally, I get these numbers (sequentual write): UFS: 38MB/s ZFS: 38MB/s Zvol UFS: 6MB/s Zvol Raw: ~6MB/s ZFS is nice and fast but Zvol performance just drops off a cliff. Suggestion or observations by others using zvol would be extremely helpful. # zfs create -V 1g data/zvol-test # time dd if=/data/media/sol-10-u2-ga-x86-dvd.iso of=/dev/zvol/rdsk/data/zvol-test bs=32k count=1 1+0 records in 1+0 records out 0.08u 9.37s 2:21.56 6.6% That's ~ 2.3 MB/s. I do see *frequent* DKIOCFLUSHWRITECACHE ioctls (one flush write cache ioctl after writing ~36KB of data, needs ~6-7 milliseconds per flush): 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02778, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 5736778 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e027c0, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6209599 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02808, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6572132 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02850, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6732316 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02898, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6175876 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e028e0, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6251611 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02928, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7756397 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02970, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6393356 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e029b8, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6147003 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02a00, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6247036 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02a48, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6061991 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02a90, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6284297 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02ad8, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6174818 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5e02b20, count 9000 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6245923 nsec, error 0 dtrace with stack backtraces: 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5d1ec10, count 9000 0 39404 zio_ioctl:entry zfs`zil_flush_vdevs+0x144 zfs`zil_commit+0x311 zfs`zvol_strategy+0x4bc genunix`default_physio+0x308 genunix`physio+0x1d zfs`zvol_write+0x22 genunix`cdev_write+0x25 specfs`spec_write+0x4d6 genunix`fop_write+0x2e genunix`write+0x2ae unix`sys_sysenter+0x104 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 6638189 nsec, error 0 0 12308 bdev_strategy:entry edev 1980047, flags 1080101, bn 5d1ec58, count 9000 0 39404 zio_ioctl:entry zfs`zil_flush_vdevs+0x144 zfs`zil_commit+0x311 zfs`zvol_strategy+0x4bc genunix`default_physio+0x308 genunix`physio+0x1d zfs`zvol_write+0x22 genunix`cdev_write+0x25 specfs`spec_write+0x4d6 genunix`fop_write+0x2e genunix`write+0x2ae unix`sys_sysenter+0x104 0 38530 vdev_disk_ioctl_done:entry DKIOCFLUSHWRITECACHE time: 7881400 nsec, error 0
Re: [zfs-discuss] How to best layout our filesystems
Brian Hechinger wrote On 07/26/06 06:49,: On Tue, Jul 25, 2006 at 03:54:22PM -0700, Eric Schrock wrote: If you give zpool(1M) 'whole disks' (i.e. no 's0' slice number) and let it label and use the disks, it will automatically turn on the write cache for you. What if you can't give ZFS whole disks? I run snv_38 on the Optiplex GX620 on my desk at work and I run snv_40 on the Latitude D610 that I carry with me. In both cases the machines only have one disk, so I need to split it up for UFS for the OS and ZFS for my data. How do I turn on write cache for partial disks? -brian You can't enable write caching for just part of the disk. We don't enable it for slices because UFS (and other file systems) doesn't do write cache flushing and so could get corruption on power failure. I suppose if you know the disk only contains zfs slices then write caching could be manually enabled using format -e - cache - write_cache - enable Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zil_disable
Not quite, zil_disable is inspected on file system mounts. It's also looked at dynamically on every write for zvols. Neil. Robert Milkowski wrote On 08/07/06 10:07,: Hello zfs-discuss, Just a note to everyone experimenting with this - if you change it online it has only effect when pools are exported and then imported. ps. I didn't use for my last posted benchmarks - with it I get about 35,000IOPS and 0.2ms latency - but it's meaningless. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zil_disable
Robert Milkowski wrote: Hello Neil, Monday, August 7, 2006, 6:40:01 PM, you wrote: NP Not quite, zil_disable is inspected on file system mounts. I guess you right that umount/mount will suffice - I just hadn't time to check it and export/import worked. Anyway is there a way for file systems to make it active without unmount/mount in current nevada? No, sorry. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zil_disable
Robert Milkowski wrote: Hello Eric, Monday, August 7, 2006, 6:29:45 PM, you wrote: ES Robert - ES This isn't surprising (either the switch or the results). Our long term ES fix for tweaking this knob is: ES 6280630 zil synchronicity ES Which would add 'zfs set sync' as a per-dataset option. A cut from the ES comments (which aren't visible on opensolaris): ES sync={deferred,standard,forced} ES Controls synchronous semantics for the dataset. ES ES When set to 'standard' (the default), synchronous ES operations such as fsync(3C) behave precisely as defined ES in fcntl.h(3HEAD). ES When set to 'deferred', requests for synchronous ES semantics are ignored. However, ZFS still guarantees ES that ordering is preserved -- that is, consecutive ES operations reach stable storage in order. (If a thread ES performs operation A followed by operation B, then the ES moment that B reaches stable storage, A is guaranteed to ES be on stable storage as well.) ZFS also guarantees that ES all operations will be scheduled for write to stable ES storage within a few seconds, so that an unexpected ES power loss only takes the last few seconds of change ES with it. ES When set to 'forced', all operations become synchronous. ES No operation will return until all previous operations ES have been committed to stable storage. This option can ES be useful if an application is found to depend on ES synchronous semantics without actually requesting them; ES otherwise, it will just make everything slow, and is not ES recommended. ES There was a thread describing the usefulness of this (for builds where ES all-or-nothing over a long period of time), but I can't find it. I remember the thread. Do you know if anyone is currently working on it and when is it expected to be integrated into snv? I'm slated to work on it after I finish up some other ZIL bugs and performance fixes. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS RAID10
Robert Milkowski wrote: Hello Matthew, Thursday, August 10, 2006, 6:55:41 PM, you wrote: MA On Thu, Aug 10, 2006 at 06:50:45PM +0200, Robert Milkowski wrote: btw: wouldn't it be possible to write block only once (for synchronous IO) and than just point to that block instead of copying it again? MA We actually do exactly that for larger (32k) blocks. Why such limit (32k)? By experimentation that was the cutoff where it was found to be more efficient. It was recently reduced from 64K with a more efficient dmu-sync() implementaion. Feel free to experiment with the dynamically changable tunable: ssize_t zfs_immediate_write_sz = 32768; -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS RAID10
Robert Milkowski wrote: Hello Neil, Thursday, August 10, 2006, 7:02:58 PM, you wrote: NP Robert Milkowski wrote: Hello Matthew, Thursday, August 10, 2006, 6:55:41 PM, you wrote: MA On Thu, Aug 10, 2006 at 06:50:45PM +0200, Robert Milkowski wrote: btw: wouldn't it be possible to write block only once (for synchronous IO) and than just point to that block instead of copying it again? MA We actually do exactly that for larger (32k) blocks. Why such limit (32k)? NP By experimentation that was the cutoff where it was found to be NP more efficient. It was recently reduced from 64K with a more NP efficient dmu-sync() implementaion. NP Feel free to experiment with the dynamically changable tunable: NP ssize_t zfs_immediate_write_sz = 32768; I've just checked using dtrace on one of production nfs servers that 90% of the time arg5 in zfs_log_write() is exactly 32768 and the rest is always smaller. With default 32768 value of 32768 it means that for NFS servers it will always copy data as I've just checked in the code and there is: 245 if (len zfs_immediate_write_sz) { So in nfs server case above never will be true (with default nfs srv settings). Wouldn't nfs server benefit from lowering zfs_immediate_write_sz to 32767? Yes NFS (with default 32K max write sz) would benefit if WR_INDIRECT writes (using dmu_sync()) were faster, but that wasn't the case when last benchmarked. I'm sure there are some cases currently where tuning zfs_immediate_write_sz will help certain workloads. Anyway, I think this whole area deserves more thought. If you experiment with tuning zfs_immediate_write_sz, then please share any performance data for your application/benchmark(s). Thanks: Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fdatasync
Myron Scott wrote: Is there any difference between fdatasync and fsync on ZFS? -No. ZFS does not log data and meta data separately. rather it logs essentially the system call records, eg writes, mkdir, truncate, setattr, etc. So fdatasync and fsync are identical on ZFS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Significant pauses during zfs writes
Yes James is right this is normal behaviour. Unless the writes are synchronous (O_DSYNC) or explicitely flushed (fsync()) then they are batched up, written out and committed as a transaction every txg_time (5 seconds). Neil. James C. McPherson wrote: Bob Evans wrote: Just getting my feet wet with zfs. I set up a test system (Sunblade 1000, dual channel scsi card, disk array with 14x18GB 15K RPM SCSI disks) and was trying to write a large file (10 GB) to the array to see how it performed. I configured the raid using raidz. During the write, I saw the disk access lights come on, but I noticed a peculiar behavior. The system would write to the disk, but then pause for a few seconds, then contineu, then pause for a few seconds. I saw the same behavior when I made a smaller raidz using 4x36 GB scsi drives in a different enclosure. Since I'm new to zfs, and realize that I'm probably missing something, I was hoping somebody might help shed some light on my problem. Hi Bob, I'm pretty sure that's not a problem that you're seeing, just ZFS' normal behaviour. Writes are coalesced as much as possible, so the pauses that you observed are most likely going to be the system waiting for suitable IOs to be gathered up and sent out to your storage. If you want to examine this a bit more then might I suggest the DTrace Toolkit's iosnoop utility. best regards, James C. McPherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: 3510 HW RAID vs 3510 JBOD ZFS SOFTWARE RAID
Robert Milkowski wrote: ps. however I'm really concerned with ZFS behavior when a pool is almost full, there're lot of write transactions to that pool and server is restarted forcibly or panics. I observed that file systems on that pool will mount in 10-30 minutes each during zfs mount -a, and one CPU is completely consumed. It's during system start-up so basically whole system boots waits for it. It means additional 1 hour downtime. This is something really unexpected for me and unfortunately no one was really interested in my report - I know people are busy. But still if it hits other users when zfs pools will be already populated people won't be happy. For more details see my post here with subject: zfs mount stuck in zil_replay. That problem must have fallen through the cracks. Yes we are busy, but we really do care about your experiences and bugs. I have just raised a bug to cover this issue: 6460107 Extremely slow mounts after panic - searching space maps during replay Thanks for reporting this and helping make ZFS better. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Bizzare problem with ZFS filesystem
It is highly likely you are seeing a duplicate of: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS which was fixed recently in build 48 on Nevada. The symptoms are very similar. That is a fsync from the vi would, prior to the bug being fixed, have to force out all other data through the intent log. Neil. Anantha N. Srirama wrote On 09/13/06 15:58,: One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. I am currently awaiting the completion of a write to /u099, well over 60 seconds. At the same time I was able create/save files in /u001 without any problems. The only difference between the /u001 and /u099 is the size of the filesystem (256GB vs 768GB). Per your suggestion I ran a 'zfs set' command and it completed after a wait of around 20 seconds while my file save from vi against /u099 is still pending!!! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Importing ZFS filesystems across architectures...
Philip Brown wrote On 09/21/06 20:28,: Eric Schrock wrote: If you're using EFI labels, yes (VTOC labels are not endian neutral). ZFS will automatically convert endianness from the on-disk format, and new data will be written using the native endianness, so data will be gradually be rewritten to avoid the byteswap overhead. now, when you say data, you just mean metadata, right? Yes. ZFS has no knowledge of the layout of any structured records written by applications, so it can't byteswap user data. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] panic string assistance
ZFS will currently panic on a write failure to a non replicated pool. In the case below the Intent Log (though it could have been any module) could not write an intent log block. Here's a previous response from Eric Schrock explaining how ZFS intends to handle this: Yes, there are three incremental fixes that we plan in this area: 6417772 need nicer message on write failure This just cleans up the failure mode so that we get a nice FMA failure message and can distinguish this from a random failed assert. 6417779 ZFS: I/O failure (write on ...) -- need to reallocate writes In a multi-vdev pool, this would take a failed write and attempt to do the write on another toplevel vdev. This would all but elminate the problem for multi-vdev pools. 6322646 ZFS should gracefully handle all devices failing (when writing) This is the real fix. Unfortunately, it's also really hard. Even if we manage to abort the current transaction group, dealing with the semantics of a filesystem which has lost an arbitrary amount of change and notifying the user in a meaningful way is difficult at best. Hope that helps. - Eric Frank Leers wrote On 10/03/06 15:10,: Could someone offer insight into this panic, please? panic string: ZFS: I/O failure (write on unknown off 0: zio 6000c5fbc0 0 [L0 ZIL intent log] 1000L/1000P DVA[0]=1:249b68000:1000 zilog uncompre ssed BE contiguous birth=318892 fill=0 cksum=3b8f19730caa4327:9e102 panic kernel thread: 0x2a1015d7cc0 PID: 0 on CPU: 530 cmd: sched t_procp: 0x187c780(proc_sched) p_as: 0x187e4d0(kas) zone: global t_stk: 0x2a1015d7ad0 sp: 0x18aa901 t_stkbase: 0x2a1015d2000 t_pri: 99(SYS) pctcpu: 0.00 t_lwp: 0x0psrset: 0 last CPU: 530 idle: 0 ticks (0 seconds) start: Wed Sep 20 18:17:22 2006 age: 1788 seconds (29 minutes 48 seconds) tstate: TS_ONPROC - thread is being run on a processor tflg: T_TALLOCSTK - thread structure allocated from stk T_PANIC - thread initiated a system panic tpflg: none set tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped TS_SIGNALLED - thread was awakened by cv_signal() pflag: SSYS - system resident process pc: 0x105f7f8 unix:panicsys+0x48: call unix:setjmp startpc: 0x119fa64 genunix:taskq_thread+0x0: save%sp, -0xd0, %sp unix:panicsys+0x48(0x7b6e53a0, 0x2a1015d77c8, 0x18ab2d0, 0x1, , , 0x4480001601, , , , , , , , 0x7b6e53a0, 0x2a1015d77c8) unix:vpanic_common+0x78(0x7b6e53a0, 0x2a1015d77c8, 0x7b6e3bf8, 0x7080bc30, 0x708 0bc70, 0x7080b840) unix:panic+0x1c(0x7b6e53a0, 0x7080bbf0, 0x7080bbc0, 0x7b6e4428, 0x0, 0x6000c5fbc 00, , 0x5) zfs:zio_done+0x284(0x6000c5fbc00) zfs:zio_next_stage(0x6000c5fbc00) - frame recycled zfs:zio_vdev_io_assess+0x178(0x6000c5fbc00, 0x6000c586da0, 0x7b6c79f0) genunix:taskq_thread+0x1a4(0x6000bc5ea38, 0x0) unix:thread_start+0x4() ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fsflush and zfs
ZFS ignores the fsflush. Here's a snippet of the code in zfs_sync(): /* * SYNC_ATTR is used by fsflush() to force old filesystems like UFS * to sync metadata, which they would otherwise cache indefinitely. * Semantically, the only requirement is that the sync be initiated. * The DMU syncs out txgs frequently, so there's nothing to do. */ if (flag SYNC_ATTR) return (0); However, for a user initiated sync(1m) and sync(2) ZFS does force all outstanding data/transactions synchronously to disk . This goes beyond the requirement of sync(2) which says IO is inititiated but not waited on (ie asynchronous). Neil. ttoulliu2002 wrote On 10/13/06 00:06,: Is there any change regarding fsflush such as autoup tunable for zfs ? Thanks This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots impact on performance
Matthew Ahrens wrote On 10/16/06 09:07,: Robert Milkowski wrote: Hello zfs-discuss, S10U2+patches. ZFS pool of about 2TB in size. Each day snapshot is created and 7 copies are kept. There's quota set for a file system however there's always at least 50GB of free space in a file system (and much more in a pool). ZFS file system is exported over NFS. Snapshots consume about 280GB of space. We have noticed so performance problems on nfs clients to this file system even during times with smaller load. Rising quota didn't help. However removing oldest snapshot automatically solved performance problems. I do not have more details - sorry. Is it expected for snapshots to have very noticeable performance impact on file system being snapshoted? No, this behavior is unexpected. The only way that snapshots should have a performance impact on access to the filesystem is if you are running low on space in the pool or quota (which it sounds like you are not). Can you describe what the performance problems were? What was the workload like? What problem did you identify? How did it improve when you 'zfs destroy'-ed the oldest snapshot? Are you sure that the oldest snapshot wasn't pushing you close to your quota? --matt I could well believe there would be a hiccup when the snapshot is taken on the rest of the pool. Each snapshot calls txg_wait_synced four times. A few related to the zil and one from dsl_sync_task_group_wait ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Porting ZFS file system to FreeBSD.
Pawel, I second that praise. Well done! Attached is a copy of ziltest. You will have to adapt this a bit to your environment. In particular it uses bringover to pull a subtree of our source and then builds and later runs it. This tends to create a fair number of transactions with various dependencies. You'll obviously have to update the paths and tools. However, at least initially, I'd recommend you simplify things by perhaps jhaving the only test as a creation of a file. The basic flow behind ziltest is: 1. Create an empty file system FS1 2. Freeze FS1 3. Perform various user commands that create files, directories, etc 4. Copy FS1 to FS2 5. Unmount and unfreeze FS1 6. Remount FS1 (resulting in replay of log) 7. Compare FS1 FS2 and complain if not equal Hope this helps and good luck: Neil. Eric Schrock wrote On 10/27/06 10:18,: Congrats, Pawel. This is truly an impressive piece of work. As you're probably aware, Noel integrated the patches your provided us into build 51. Hopefully that got rid of some spurious differences between the code bases. We do have a program called 'ziltest' that Neil can probably provide for you that does a good job stressing the ZIL. We also have a complete test suite (functional and stress), but it would be non-trivial to port, and I don't know what the current status is for open sourcing the test suites in general. Let us know if there's anything else we can help with. - Eric On Fri, Oct 27, 2006 at 05:41:49AM +0200, Pawel Jakub Dawidek wrote: Here is another update: After way too much time spend on fighting the buffer cache I finally made mmap(2)ed reads/writes to work and (which is also very important) keep regular reads/writes working. Now I'm able to build FreeBSD's kernel and userland with both sources and objects placed on ZFS file system. I also tried to crash it with fsx, fsstress and postmark, but no luck, it works stable. On the other hand I'm quite sure there are many problems in ZPL still, but fixing mmap(2) allows me to move forward. As a said note - ZVOL seems to be full functional. I need to find a way to test ZIL, so if you guys at SUN have some ZIL tests like uncleanly stopped file system, which at mount time will exercise entire ZIL functionality where we can verify that my FS was fixed properly that would be great. PS. There is still a lot to do, so please, don't ask me for patches yet. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss #!/bin/ksh -x # # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License, Version 1.0 only # (the License). You may not use this file except in compliance # with the License. # # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or http://www.opensolaris.org/os/licensing. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets [] replaced with your own identifying # information: Portions Copyright [] [name of copyright owner] # # CDDL HEADER END # # # Copyright 2006 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # ident @(#)ziltest 1.2 06/01/30 SMI # # - creates a 150MB pool in /tmp # - Should take about a minute (depends on access to the gate for bringover). # - You can change the gate to local by setting and exporting ZILTEST_GATE # PATH=/usr/bin PATH=$PATH:/usr/sbin PATH=$PATH:/usr/ccs/bin #PATH=$PATH:/net/slug.eng/opt/export/`uname -p`/opt/SUNWspro/SOS8/bin #PATH=$PATH:/net/anthrax.central/export/tools/onnv-tools/SUNWspro/SOS8/bin PATH=$PATH:/net/haulass.central/export/tools/onnv-tools/SUNWspro/SOS8/bin #PATH=$PATH:/net/slug.eng/opt/onbld/bin PATH=$PATH:/opt/onbld/bin export PATH # # SETUP # ZILTEST_GATE=${ZILTEST_GATE-/net/haulass.central/export/clones/onnv} CMD=`basename $0` POOL=ziltestpool.$$ DEVSIZE=${DEVSIZE-150m} POOLDIR=/tmp POOLFILE=$POOLDIR/ziltest_poolfile.$$ FS=$POOL/fs ROOT=/$FS COPY=/tmp/${POOL} KEEP=no cleanup() { zfs destroy $FS zpool iostat $POOL print zpool status $POOL zpool destroy $POOL rm -rf $COPY rm $POOLFILE } bail() { test $KEEP = no cleanup print $1 exit 1 } test $# -eq 0 || bail usage: $CMD mkfile $DEVSIZE $POOLFILE || bail can't make
Re: [zfs-discuss] Re: Re: ZFS hangs systems during copy
Jürgen Keil wrote On 10/27/06 11:55,: This is: 6483887 without direct management, arc ghost lists can run amok That seems to be a new bug? http://bugs.opensolaris.org does not yet find it. It's not so new as it was created on 10/19, but as you say bug search doesn't find it. However, you can access it directly: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6483887 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] linux versus sol10
Robert Milkowski wrote On 11/08/06 08:16,: Hello Paul, Wednesday, November 8, 2006, 3:23:35 PM, you wrote: PvdZ On 7 Nov 2006, at 21:02, Michael Schuster wrote: listman wrote: hi, i found a comment comparing linux and solaris but wasn't sure which version of solaris was being referred. can the list confirm that this issue isn't a problem with solaris10/zfs?? Linux also supports asynchronous directory updates which can make a significant performance improvement when branching. On Solaris machines, inode creation is very slow and can result in very long iowait states. I think this cannot be commented on in a useful fashion without more information this supposed issue. AFAIK, neither ufs nor zfs create inodes (at run time), so this is somewhat hard to put into context. get a complete description of what this is about, then maybe we can give you a useful answer. PvdZ This could be related to Linux trading reliability for speed by doing PvdZ async metadata updates. PvdZ If your system crashes before your metadata is flushed to disk your PvdZ filesystem might be hosed and a restore PvdZ from backups may be needed. you can achieve something similar with fastfs on ufs file systems and setting zil_disable to 1 on ZFS. There's a difference for both of these. UFS now has logging (journalling) as the default, and so any crashes/power fails will keep the integrity of the metadata intact (ie no fsck/restore). ZFS has no problem either as its fully transacts both data and meta data and should never see corruption with intent log disabled or enabled. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48
Tomas Ögren wrote On 11/09/06 09:59,: 1. DNLC-through-ZFS doesn't seem to listen to ncsize. The filesystem currently has ~550k inodes and large portions of it is frequently looked over with rsync (over nfs). mdb said ncsize was about 68k and vmstat -s said we had a hitrate of ~30%, so I set ncsize to 600k and rebooted.. Didn't seem to change much, still seeing hitrates at about the same and manual find(1) doesn't seem to be that cached (according to vmstat and dnlcsnoop.d). When booting, the following message came up, not sure if it matters or not: NOTICE: setting nrnode to max value of 351642 NOTICE: setting nrnode to max value of 235577 Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is that it has its own implementation which is integrated with the rest of the ZFS cache which throws out metadata cache in favour of data cache.. or something.. A more complete and useful set of dnlc statistic can be obtained via kstat -n dnlcstats. As well as soft the limit on dnlc entries (ncsize) the current number of cached entries is also useful: echo ncsize/D | mdb -k echo dnlc_nentries/D | mdb -k nfs does have a maximum nmber of rnodes which is calculated from the memory available. It doesn't look like nrnode_max can be overridden. Having said that I actually think your problem is lack of memory. For each ZFS vnode held by the DNLC it uses a *lot* more memory than say UFS. Consequently it has to purge dnlc entries and I suspect with only 1GB that the ZFS ARC doesn't allow many dnlc entries. I don't know if that number is maintained anywhere, for you to check. Mark? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some performance questions with ZFS/NFS/DNLC at snv_48
Tomas Ögren wrote On 11/09/06 13:47,: On 09 November, 2006 - Neil Perrin sent me these 1,6K bytes: Tomas Ögren wrote On 11/09/06 09:59,: 1. DNLC-through-ZFS doesn't seem to listen to ncsize. The filesystem currently has ~550k inodes and large portions of it is frequently looked over with rsync (over nfs). mdb said ncsize was about 68k and vmstat -s said we had a hitrate of ~30%, so I set ncsize to 600k and rebooted.. Didn't seem to change much, still seeing hitrates at about the same and manual find(1) doesn't seem to be that cached (according to vmstat and dnlcsnoop.d). When booting, the following message came up, not sure if it matters or not: NOTICE: setting nrnode to max value of 351642 NOTICE: setting nrnode to max value of 235577 Is there a separate ZFS-DNLC knob to adjust for this? Wild guess is that it has its own implementation which is integrated with the rest of the ZFS cache which throws out metadata cache in favour of data cache.. or something.. A more complete and useful set of dnlc statistic can be obtained via kstat -n dnlcstats. As well as soft the limit on dnlc entries (ncsize) the current number of cached entries is also useful: This is after ~28h uptime: module: unixinstance: 0 name: dnlcstats class:misc crtime 47.5600948 dir_add_abort 0 dir_add_max 0 dir_add_no_memory 0 dir_cached_current 4 dir_cached_total107 dir_entries_cached_current 4321 dir_fini_purge 0 dir_hits11000 dir_misses 172814 dir_reclaim_any 25 dir_reclaim_last16 dir_remove_entry_fail 0 dir_remove_space_fail 0 dir_start_no_memory 0 dir_update_fail 0 double_enters 234918 enters 59193543 hits36690843 misses 59384436 negative_cache_hits 1366345 pick_free 0 pick_heuristic 57069023 pick_last 2035111 purge_all 1 purge_fs1 0 purge_total_entries 3748 purge_vfs 187 purge_vp95 snaptime99177.711093 vmstat -s: 96080561 total name lookups (cache hits 38%) echo ncsize/D | mdb -k echo dnlc_nentries/D | mdb -k ncsize: 60 dnlc_nentries: 19230 Not quite the same.. Having said that I actually think your problem is lack of memory. For each ZFS vnode held by the DNLC it uses a *lot* more memory than say UFS. Consequently it has to purge dnlc entries and I suspect with only 1GB that the ZFS ARC doesn't allow many dnlc entries. I don't know if that number is maintained anywhere, for you to check. Mark? Current memory usage (for some values of usage ;): # echo ::memstat|mdb -k Page SummaryPagesMB %Tot Kernel 95584 746 75% Anon20868 163 16% Exec and libs1703131% Page cache 1007 71% Free (cachelist) 97 00% Free (freelist) 7745606% Total 127004 992 Physical 125192 978 /Tomas This memory usage shows nearly all of memory consumed by the kernel and probably by ZFS. ZFS can't add any more DNLC entries due to lack of memory without purging others. This can be seen from the number of dnlc_nentries being way less than ncsize. I don't know if there's a DMU or ARC bug to reduce the memory footprint of their internal structures for situations like this, but we are aware of the issue. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount stuck in zil_replay
Hi Robert, Yes, it could be related, or even the bug. Certainly the replay was (prior to this bug fix) extremely slow. I don't really have enough information to determine if it's the exact problem, though after re-reading your original post I strongly suspect it is. I also putback a companion fix which should be helpful in determining if zil_replay is making progress, or hung: 6486496 zil_replay() useful debug Neil. Robert Milkowski wrote On 11/09/06 18:10,: Hello Neil, I can see http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6478388 integrated. I guess it could be related to problem I described here, right? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Managed to corrupt my pool
Jim, I'm not at all sure what happened to your pool. However, I can answer some of your questions. Jim Hranicky wrote On 12/05/06 11:32,: So the questions are: - is this fixable? I don't see an inum I could run find on to remove, I think the pool is busted. Even the message printed in your previous email is bad: DATASET OBJECT RANGE 15 0 lvl=4294967295 blkid=0 as level is way out of range. and I can't even do a zfs volinit anyway: nextest-01# zfs volinit cannot iterate filesystems: I/O error I'm not sure why you're using zfs volinit which I believe creates the zvol links, but this further shows problems. - would not enabling zil_disable have prevented this? No the intent log is not needed for pool integrity. It ensures the synchronous semantics of O_DSYNC/fsync are obeyed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Ben, The attached dscript might help determining the zfs_create issue. It prints: - a count of all functions called from zfs_create - average wall count time of the 30 highest functions - average cpu time of the 30 highest functions Note, please ignore warnings of the following type: dtrace: 1346 dynamic variable drops with non-empty dirty list Neil. Ben Rockwood wrote On 12/07/06 06:01,: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:5127.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:1319.8 99 0.0 0 10:46:2317.7 99 0.0 0 10:46:3424.4 99 0.0 0 10:46:4122.1 97 0.0 0 10:46:4813.0 96 0.0 0 10:46:5525.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 324 224000 1540 00 100 0 0 10 00 1140 2260 10 130860 1 0 99 20 00 162 138 1490540 00
Re: [zfs-discuss] Re: ZFS Storage Pool advice
Are you looking purely for performance, or for the added reliability that ZFS can give you? If the latter, then you would want to configure across multiple LUNs in either a mirrored or RAID configuration. This does require sacrificing some storage in exchange for the peace of mind that any “silent data corruption” in the array or storage fabric will be not only detected but repaired by ZFS. From a performance point of view, what will work best depends greatly on your application I/O pattern, how you would map the application’s data to the available ZFS pools if you had more than one, how many channels are used to attach the disk array, etc. A single pool can be a good choice from an ease-of-use perspective, but multiple pools may perform better under certain types of load (for instance, there’s one intent log per pool, so if the intent log writes become a bottleneck then multiple pools can help). Bad example, as there's actually one intent log per file system! This also depends on how the LUNs are configured within the EMC array If you can put together a test system, and run your application as a benchmark, you can get an answer. Without that, I don’t think anyone can predict which will work best in your particular situation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring ZFS
Tom Duell wrote On 12/12/06 17:11,: Group, We are running a benchmark with 4000 users simulating a hospital management system running on Solaris 10 6/06 on USIV+ based SunFire 6900 with 6540 storage array. Are there any tools for measuring internal ZFS activity to help us understand what is going on during slowdowns? dtrace can be used in numerous ways to examine every part of ZFS and Solaris. lockstat(1M) (which actually uses dtrace underneath) can also be used to see the cpu activity (try lockstat -kgIW -D 20 sleep 10). You can also use iostat (eg iostat -xnpcz) to look at disk activity. We have 192GB of RAM and while ZFS runs well most of the time, there are times where the system time jumps up to 25-40% as measured by vmstat and iostat. These times coincide with slowdowns in file access as measured by a side program that simply reads a random block in a file... these response times can exceed 1 second or longer. ZFS commits transaction groups every 5 seconds. I suspect this flurry of activity is due to that. Commiting can indeed take longer than a second. You might be able to show this by changing it with: # echo txg_time/W 10 | mdb -kw then the activity should be longer but less frequent. I don't however recommend you keep it at that value. Any pointers greatly appreaciated! Tom ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Some ZFS questions
CT Will I be able to I tune the DMU flush rate, now set at 5 seconds? echo 'txg_time/D 0t1' | mdb -kw Er, that 'D' ahould be a 'W'. Having said that I don't think we recommend messing with the transaction group commit timing. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Difference between ZFS and UFS with one LUN from a SAN
Robert Milkowski wrote On 12/22/06 13:40,: Hello Torrey, Friday, December 22, 2006, 9:17:46 PM, you wrote: TM Roch - PAE wrote: The fact that most FS do not manage the disk write caches does mean you're at risk of data lost for those FS. TM Does ZFS? I thought it just turned it on in the places where we had TM previously turned if off. ZFS send flush cache command after each transaction group so it's sure transaction is on stable storage. ... and after every fsync, O_DSYNC, etc that writes out intent log blocks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solid State Drives?
I'm currently working on putting the ZFS intent log on separate devices which could include seperate disks and nvram/solid state devices. This would help any application using fsync/O_DSYNC - in particular DB and NFS. From protoyping considerable peformanace improvements have been seen. Neil. Kyle McDonald wrote On 01/05/07 08:10,: I know there's been much discussion on the list lately about getting HW arrays to use (or not use) their caches in a way that helps ZFS the most. Just yesterday I started seeing articles on NAND Flash Drives, and I know other Solid Stae Drive technologies have been around for a while and many times are used for transaction logs or other ways of accelerating FS's. If these devices become more prevalent, and/or cheaper I'm curious what ways ZFS could be made to bast take advantage of them? One Idea I had was for each pool allow me to designate a mirror or RaidZ of these devices just for the transaction logs. Since they're faster than normal disks, My uneducated guess is that they could boost performance. I suppose it doesn't eliminate the problems with the real drive (or array) caches though. You still need to know that the data is on the real drives before you can wipe that transaction from the transaction log right? Well... I'd still like to hear the experts ideas on how this could (or won't ever?) help ZFS out? Would changes to ZFS be required? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solid State Drives?
Robert Milkowski wrote On 01/05/07 11:45,: Hello Neil, Friday, January 5, 2007, 4:36:05 PM, you wrote: NP I'm currently working on putting the ZFS intent log on separate devices NP which could include seperate disks and nvram/solid state devices. NP This would help any application using fsync/O_DSYNC - in particular NP DB and NFS. From protoyping considerable peformanace improvements have NP been seen. Can you share any results from prototype testing? I'd prefer not to just yet as I don't want to raise expectations unduly. When testing I was using a simple local benchmark, whereas I'd prefer to run something more official such as TPC. I'm also missing a few required features in the protoype which may affect performance. Hopefully I can can provide some results soon, but even those will be unoffical. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Puzzling ZFS behavior with COMPRESS option
Anantha N. Srirama wrote On 01/08/07 13:04,: Our setup: - E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06) - 2 2Gbps FC HBA - EMC DMX storage - 50 x 64GB LUNs configured in 1 ZFS pool - Many filesystems created with COMPRESS enabled; specifically I've one that is 768GB I'm observing the following puzzling behavior: - We are currently creating a large (1.4TB) and sparse dataset; most of the dataset contains repeating blanks (default/standard SAS dataset behavior.) - ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk usage at around 65GB. - My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and I've confirmed the same with iostat. This is very confusing - ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB) - [b]Why on God's green earth am I observing such high I/O when indeed ZFS is compressing?[/b] I can't believe that the program is actually generating I/O at the rate of (150MB/S * compressratio). Any thoughts? One possibility is that the data is written synchronously (uses O_DSYNC, fsync, etc), and so the ZFS Intent Log (ZIL) will write that uncompressed data to stable storage in case of a crash/power fail before the txg is committed. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Heavy writes freezing system
Rainer Heilke wrote On 01/17/07 15:44,: It turns out we're probably going to go the UFS/ZFS route, with 4 filesystems (the DB files on UFS with Directio). It seems that the pain of moving from a single-node ASM to a RAC'd ASM is great, and not worth it. The DBA group decided doing the migration to UFS for the DB files now, and then to a RAC'd ASM later, will end up being the easiest, safest route. Rainer Still curious as to if and when this bug will get fixed... If you're referring to bug 6413510 that Anantha mentioned then my earlier post today answered that: This problem was fixed in snv_48 last September and will be in S10_U4. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Heavy writes freezing system
Anton B. Rang wrote On 01/17/07 20:31,: Yes, Anantha is correct that is the bug id, which could be responsible for more disk writes than expected. I believe, though, that this would explain at most a factor of 2 of write expansion (user data getting pushed to disk once in the intent log, then again in its final location). Agreed. If the writes are relatively large, there'd be even less expansion, because the ZIL will write a large enough block of data (would this be 128K?) Anything over zfs_immediate_write_sz (currently 32KB) is written in this way. into a block which can be used as its final location. (If I'm understanding some earlier conversations right; haven't looked at the code lately.) Anton This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bug id 6381203
Hi Leon, This was fixed in March 2006, and is in S10_U2. Neil. Leon Koll wrote On 01/28/07 08:58,: Hello, what is the status of the bug 6381203 fix in S10 u3 ? (deadlock due to i/o while assigning (tc_lock held)) Was it integrated? Is there a patch? Thanks, [i]-- leon[/i] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS inode equivalent
No it's not the final version or even the latest! The current on disk format version is 3. However, it hasn't diverged much and the znode/acl stuff hasn't changed. Neil. James Blackburn wrote On 01/31/07 14:31,: Or look at pages 46-50 of the ZFS on-disk format document: http://opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf There's an final version? That link appears to be broken (and the lastest version linked from the ZFS docs area http://opensolaris.org/os/community/zfs/docs/ is dated 0822). James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS checksums - block or file level
ZFS checksums are at the block level. Nathan Essex wrote On 02/01/07 08:27,: I am trying to understand if zfs checksums apply at a file or a block level. We know that zfs provides end to end checksum integrity, and I assumed that when I write a file to a zfs filesystem, the checksum was calculated at a file level, as opposed to say, a block level. However, I have noticed that when I create an emulated volume, that volume has a checksum property, set to the same default as a normal zfs filesystem. I can even change the checksum value as normal, see below: # /usr/sbin/zfs create -V 50GB -b 128KB mypool/myvol # /usr/sbin/zfs set checksum=sha256 mypool/myvol Now on this emulated volume, I could place any number of structures that are not zfs filesystems, say raw database volumes, or ufs, qfs, etc. Since these do not perform end to end checksums, can someone explain to me what the zfs checksum would be doing at this point? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [storage-discuss] Re[2]: [zfs-discuss] se3510 and ZFS
Robert Milkowski wrote On 02/06/07 11:43,: Hello eric, Tuesday, February 6, 2007, 5:55:23 PM, you wrote: IIRC Bill posted here some tie ago saying the problem with write cache on the arrays is being worked on. ek Yep, the bug is: ek 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE ek CACHE to ek SBC-2 devices Thanks. I see a workaround there (I saw it earlier but it doesn't apply to 3510) and I have a question - setting zil_disable to 1 won't actually completely disable cache flushing, right? (still every txg group completes cache would be flushed)?? ek We have a case going through PSARC that will make things works ek correctly with regards to flushing the write cache and non-volatile ek caches. There's actually a tunable to disable cache flushes: zfs_nocacheflush and in older code (like S10U3) it's zil_noflush. Yes, but we didn't want to publicise this internal switch. (I would not call it a tunable). We (or at least I) are regretting publicising zil_disable, but using zfs_nocacheflush is worse. If the device is volatile then we can get pool corruption. An uberblock could get written before all of its tree. Note, zfs_nocacheflush and zil_noflush are not the same. Setting zil_noflush stopped zil flushes of the write cache, whereas zfs_nocacheflush will additionally stop flushing for txgs. H... ek The tricky part is getting vendors to actually support SYNC_NV bit. ek If you your favorite vendor/array doesn't support it, feel free to ek give them a call... Is there any work being done to ensure/check that all arrays Sun sells do support it? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Efficiency when reading the same file blocks
Jeff Davis wrote On 02/25/07 20:28,: if you have N processes reading the same file sequentially (where file size is much greater than physical memory) from the same starting position, should I expect that all N processes finish in the same time as if it were a single process? Yes I would expect them to finish the same time. There should be no additional reads because the data will be in the ZFS cache (ARC). Given your question are you about to come back with a case where you are not seeing this? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Today PANIC :(
Gino, We have ween this before but only very rarely and never got a good crash dump. Coincidently, we saw it only yesterday on a server here, and are currently investigating it. Did you also get a dump we can access? That would If not can you tell us what zfs version you were running. At the moment I'm not sure how even you can recover from it. Sorry about this problem. FYI this is bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6458218 Neil. Gino Ruopolo wrote On 02/28/07 02:17,: Feb 28 05:47:31 server141 genunix: [ID 403854 kern.notice] assertion failed: ss == NULL, file: ../../common/fs/zfs/space_map.c, line: 81 Feb 28 05:47:31 server141 unix: [ID 10 kern.notice] Feb 28 05:47:31 server141 genunix: [ID 802836 kern.notice] fe8000d559f0 fb9acff3 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55a70 zfs:space_map_add+c2 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55aa0 zfs:space_map_free+22 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55ae0 zfs:space_map_vacate+38 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b40 zfs:zfsctl_ops_root+2fdbc7e7 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b70 zfs:vdev_sync_done+2b () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55bd0 zfs:spa_sync+215 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c60 zfs:txg_sync_thread+115 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c70 unix:thread_start+8 () Feb 28 05:47:31 server141 unix: [ID 10 kern.notice] Feb 28 05:47:31 server141 genunix: [ID 672855 kern.notice] syncing file systems... Feb 28 05:47:32 server141 genunix: [ID 733762 kern.notice] 1 Feb 28 05:47:33 server141 genunix: [ID 904073 kern.notice] done What happened this time? Any suggest? thanks, gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mirror question
Yes, this is supported now. Replacing one half of a mirror with a larger device; letting it resilver; then replacing the other half does indeed get a larger mirror. I believe this is described somewhere but I can't remember where now. Neil. Richard L. Hamilton wrote On 03/23/07 20:45,: If I create a mirror, presumably if possible I use two or more identically sized devices, since it can only be as large as the smallest. However, if later I want to replace a disk with a larger one, and detach the mirror (and anything else on the disk), replace the disk (and if applicable repartition it), since it _is_ a larger disk (and/or the partitions will likely be larger since they mustn't be smaller, and blocks per cylinder will likely differ, and partitions are on cylinder boundaries), once I reattach everything, I'll now have two different sized devices in the mirror. So far, the mirror is still the original size. But what if I later replace the other disks with ones identical to the first one I replaced? With all the devices within the mirror now the larger size, will the mirror and the zpool of which it is a part expand? And if that won't happen automatically, can it (without inordinate trickery, and online, i.e. without backup and restore) be forced to do so? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Matthew Ahrens wrote On 03/24/07 12:13,: Kangurek wrote: Thanks for info. My idea was to traverse changing filesystem, now I see that it will not work. I will try to traverse snapshots. Zreplicate will: 1. do snapshot @replicate_leatest and 2. send data to snapshot @replicate_leatest 3. wait X sec ( X = 20 ) 4. remove @replicate_previous, rename @replicate_latest to @replicate_previous 5. repeat from 1. I'm sure it will work, but taking snapshots will be slow on loaded filesystem. Do you have any idea how to speed up operations on snapshots. 1. remove @replicate_previous 2. rename @replicate_leatest to @replicate_previous 3. create @replicate_leatest You can avoid the rename by doing: zfs create @A again: zfs destroy @B zfs create @B zfs send @A @B zfs destroy @A zfs create @A zfs send @B @A goto again I'm not sure exactly what will be slow about taking snapshots, but one aspect might be that we have to suspend the intent log (see call to zil_suspend() in dmu_objset_snapshot_one()). I've been meaning to change that for a while now -- just let the snapshot have the (non-empty) zil header in it, but don't use it (eg. if we rollback or clone, explicitly zero out the zil header). So you might want to look into that. I've always thought the slowness was due to the txg_wait_synced(). I just counted 5 for one snapshot: [0] $c zfs`txg_wait_synced+0xc(30005c51dc0, 0, 7aa610d3, 70170800, ...) zfs`zil_commit_writer+0x34c(30010c55200, 151, 151, 1, 3fe, 7aa84600) zfs`zil_commit+0x68(30010c55200, 151, 0, 30010c5527c, 151, 0) zfs`zil_suspend+0xc0(30010c55200, 2a1010db240, 0, 0, 30014b32e00, 0) zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0) zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...) [0] $c zfs`txg_wait_synced+0xc(30005c51dc0, 3, 151, c00431549f, 3fe, 7aa84600) zfs`zil_destroy+0xc(30010c55200, 0, 0, 30010c5527c, 30014b32e00, 0) zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0) zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0) zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400,...) [0] $c zfs`txg_wait_synced+0xc(30005c51dc0, 36f8, 30593b0, 1f8, 1f8, 180c000) zfs`zil_destroy+0x1b0(30010c55200, 0, 701d5760, 30010c5527c, ...) zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0) zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0) zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...) [0] $c zfs`txg_wait_synced+0xc(30005c51dc0, 36f9, 30593b0, 1f8, 1f8, 180c000) zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, 7aa60700, ...) zfs`dmu_objset_snapshot+0x100(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...) [0] $c zfs`txg_wait_synced+0xc(30005c51dc0, 36fa, 30593b0, 1f8, 1f8, 180c000) zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, ...) zfs`dsl_sync_task_do+0x28(30005c51dc0, 0, 7aa2d898, 300028f7680,...) zfs`spa_history_log+0x30(300028f7680, 3000dee1490, 0, 7aa2d800, 1, 18) zfs`zfs_ioc_pool_log_history+0xd8(7aa64c00, 0, 17, 18, 3000dee1490, 7aa64c00) zfs`zfsdev_ioctl+0x12c(701cf768, 701cf660, ffbfe850, 108, 701cf400,...) --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Matthew Ahrens wrote On 03/24/07 12:36,: Neil Perrin wrote: I'm not sure exactly what will be slow about taking snapshots, but one aspect might be that we have to suspend the intent log (see call to zil_suspend() in dmu_objset_snapshot_one()). I've been meaning to change that for a while now -- just let the snapshot have the (non-empty) zil header in it, but don't use it (eg. if we rollback or clone, explicitly zero out the zil header). So you might want to look into that. I've always thought the slowness was due to the txg_wait_synced(). I just counted 5 for one snapshot: Yeah, well 3 of the 5 are for zil_suspend(), so I think you've proved my point :-) I believe that the one from spa_history_log() will go away with MarkS's delegated admin work, leaving just the one actually do it txg_wait_synced(). Bottom line, it shouldn be possible to make zfs snapshot take 5x less time, without an extraordinary effort. I'm not sure. Doing one will take the same time as more than one (assuming same txg) but at least one is needed to ensure all transactions prior to the snapshot are committed. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Size taken by a zfs symlink
Hi Robert, Robert Milkowski wrote On 04/02/07 17:48,: Right now a symlink should consume one dnode (320 bytes) dnode_phys_t are actually 512 bytes: ::sizeof dnode_phys_t sizeof (dnode_phys_t) = 0x200 if the name it point to is less than 67 bytes, otherwise a data block is allocated additionally to dnode (and more IOs will be needed to read it). And of course an entry in a directory is needed as for normal file. - Right Cheers: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync
cedric briner wrote: You might set zil_disable to 1 (_then_ mount the fs to be shared). But you're still exposed to OS crashes; those would still corrupt your nfs clients. -r hello Roch, I've few questions 1) from: Shenanigans with ZFS flushing and intelligent arrays... http://blogs.digitar.com/jjww/?itemid=44 I read : Disable the ZIL. The ZIL is the way ZFS maintains _consistency_ until it can get the blocks written to their final place on the disk. This is wrong. The on-disk format is always consistent. The author of this blog is misinformed and is probably getting confused with traditional journalling. That's why the ZIL flushes the cache. The ZIL flushes it's blocks to ensure that if a power failure/panic occurs then the data the system guarantees to be on stable storage (due say to a fsync or O_DSYNC) is actually on stable storage. If you don't have the ZIL and a power outage occurs, your blocks may go poof in your server's RAM...'cause they never made it to the disk Kemosabe. True, but not blocks, rather system call transactions - as this is what the ZIL handles. from : Eric Kustarz's Weblog http://blogs.sun.com/erickustarz/entry/zil_disable I read : Note: disabling the ZIL does _NOT_ compromise filesystem integrity. Disabling the ZIL does NOT cause corruption in ZFS. then : I don't understand: In one they tell that: - we can lose _consistency_ and in the other one they say that : - does not compromise filesystem integrity so .. which one is right ? Eric's, who works on ZFS! 2) from : Eric Kustarz's Weblog http://blogs.sun.com/erickustarz/entry/zil_disable I read: Disabling the ZIL is definitely frowned upon and can cause your applications much confusion. Disabling the ZIL can cause corruption for NFS clients in the case where a reply to the client is done before the server crashes, and the server crashes before the data is commited to stable storage. If you can't live with this, then don't turn off the ZIL. then: The service that we export with zfs NFS is not such things as databases or some really stress full system, but just exporting home. So it feels to me that we can juste disable this ZIL. 3) from: NFS and ZFS, a fine combination http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison I read: NFS service with risk of corruption of client's side view : nfs/ufs : 7 sec (write cache enable) nfs/zfs : 4.2 sec (write cache enable,zil_disable=1) nfs/zfs : 4.7 sec (write cache disable,zil_disable=1) Semantically correct NFS service : nfs/ufs : 17 sec (write cache disable) nfs/zfs : 12 sec (write cache disable,zil_disable=0) nfs/zfs : 7 sec (write cache enable,zil_disable=0) then : Does this mean that when you just create an UFS FS, and that you just export it with NFS, you are doing an not semantically correct NFS service. And that you have to disable the write cache to have an correct NFS server ??? Yes. UFS requires the write cache to be disabled to maintain consistency. 4) so can we say that people used to have an NFS with risk of corruption of client's side view can just take ZFS and disable the ZIL ? I suppose but we aim to strive for better than expected corruption. We (ZFS) recommend not disabling the ZIL. We also recommend not disabling the disk write cache flushing unless they are backed by nvram or UPS. thanks in advance for your clarifications Ced. P.-S. Does some of you know the best way to send an email containing many questions inside it ? Should I create a thread for each of them, the next time This works. - Good questions. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recovered state after system crash
kyusun Chang wrote On 05/04/07 19:34,: If system crashes some time after last commit of transaction group (TxG), what happens to the file system transactions since the last commit of TxG They are lost, unless they were synchronous (see below). (I presume last commit of TxG represents the last on-disk consistency)? Correct. Does ZFS recover all file system transactions which it returned with success since the last commit of TxG, which implis that ZIL must flush log records for each successful file system transaction before it returns to caller so that it can replay the filesystem transactions? Only synchronous transactions (those forced by O_DSYNC or fsync()) are written to the intent log. Blogs on ZIL states (I hope I read it right) that log records are maintained in-memory and flushed to disk only when 1) at synchronous write request (does that mean they free in-memory log after that), Yes they are then freed in memory 2) when TxG is committed (and free in-memory log). Thank you for your time. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] does every fsync() require O(log n) platter-writes?
Adam Megacz wrote: After reading through the ZFS slides, it appears to be the case that if ZFS wants to modify a single data block, if must rewrite every block between that modified block and the uberblock (root of the tree). Is this really the case? That is true when commiting the transaction grouptp the main pool every 5 seconds. However, this isn't so bad as a lot of transactions are commited which likely have common roots and writes are aggregated and striped across the pool etc... If so, does this mean that every commit operation (ie every fsync()) in ZFS requires O(log n) platter writes? The ZIL does not modify the main pool. It only writes system call transactions related to the file being fsynced and any other transactions that might related to that file (eg mkdir, rename). Writes for these transactions are also aggregated and written use a block size tailored to fit the data. Typically for a single system call just one write occurs. On a system crash or power fail those ZIL transactions are replayed. See also: http://blogs.sun.com/perrin Neil. Thanks, - a ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: does every fsync() require O(log n) platter-writes?
Adam Megacz wrote: Ah, okay. The slides I read said that in ZFS there is no journal -- not needed (slide #9): http://www.opensolaris.org/os/community/zfs/docs/zfs_last.pdf I guess the slides are out of date in light of the ZFS Intent Log journal? Yes , I can understand your confusion. Technically the intent log is not a journal. A journal has to be replayed to get meta data consistency of the fs. UFS logging, EXT3 and VXFS all use journals. For perf reasons user data is typically not logged leading to user data inconsistency. On the other hand, the zfs pool is always consistent whether or not the intent log is replayed. Anyways, it all makes sense now. Without a journal, you'd need to perform the operation on slide #11 for every fsync(), which would be a major performance problem. With a journal, you don't need to do this. Great work, guys... - Thanks Adam. - a ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: How does ZFS write data to disks?
lonny wrote: On May 11, 2007, at 9:09 AM, Bob Netherton wrote: **On Fri, 2007-05-11 at 09:00 -0700, lonny wrote: **I've noticed a similar behavior in my writes. ZFS seems to write in bursts of ** around 5 seconds. I assume it's just something to do with caching? ^Yep - the ZFS equivalent of fsflush. Runs more often so the pipes don't ^get as clogged. We've had lots of rain here recently, so I'm sort of ^sensitive to stories of clogged pipes. ^ **Is this behavior ok? seems it would be better to have the disks writing ** the whole time instead of in bursts. ^ ^Perhaps - although not in all cases (probably not in most cases). ^Wouldn't it be cool to actually do some nice sequential writes to ^the sweet spot of the disk bandwidth curve, but not depend on it ^so much that a single random I/O here and there throws you for ^a loop ? ^ ^Human analogy - it's often more wise to work smarter than harder :-) ^ ^Directly to your question - are you seeing any anomalies in file ^system read or write performance (bandwidth or latency) ? ^Bob No performance problems so far, the thumper and zfs seem to handle everything we throw at them. On the T2000 internal disks we were seeing a bottleneck when using a single disk for our apps but moving to a 3 disk raidz alleviated that. The only issue is when using iostat commands the bursts make it a little harder to gauge performance. Is it safe to assume that if those bursts were to reach the upper performance limit that it would spread the writes out a bit more? The burst of activity every 5 seconds is when the transaction group is committed. Batching up the writes in this way can lead to a number of efficiencies (as Bob hinted). With heavier activity the writes will not get spread out, but will just takes longer. Another way to look at the gaps of IO inactivity is that they indicate underutilisation. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and Tar/Star Performance
eric kustarz wrote: Over NFS to non-ZFS drive - tar xfvj linux-2.6.21.tar.bz2 real5m0.211s,user0m45.330s,sys 0m50.118s star xfv linux-2.6.21.tar.bz2 real3m26.053s,user0m43.069s,sys 0m33.726s star -no-fsync -x -v -f linux-2.6.21.tar.bz2 real3m55.522s,user0m42.749s,sys 0m35.294s It looks like ZFS is the culprit here. The untarring is much faster to a single 80 GB UFS drive than a 6 disk raid-z array over NFS. Comparing a ZFS pool made out of a single disk to a single UFS filesystem would be a fair comparison. Right, and to be fairer you need to ensure the disk write cache is disabled (format -e) when testing ufs (as ufs does no flushing of the cache). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Difference between add and attach a device?
Rick Mann wrote: Hi. I've been reading the ZFS admin guide, and I don't understand the distinction between adding a device and attaching a device to a pool? attach is used to create or add a side to a mirror. add is to add a new top level vdev where that can be a raidz, mirror or single device. Writes are spread across top level vdevs. Hope that helps. Perhaps the zpool man page is clearer. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on user specified devices?
Bryna, Your timing is excellent! We've been working on this for a while now and hopefully within the next day I'll be adding support for separate log devices into Nevada. I'll send out more details soon... Neil. Bryan Wagoner wrote: Quick question, Are there any tunables, or is there any way to specify devices in a pool to use for the ZIL specifically? I've been thinking through architectures to mitigate performance problems on SAN and various other storage technologies where disabling ZIL or cache flushes has been necessary to make up for performance and was wondering if there would be a way to specify a specific device or set of devices for the ZIL to use separate of the data devices so I wouldn't have to disable it in those circumstances. Thanks in advance! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Drive Failure w/o Redundancy
Darren Dunham wrote: The problem I've come across with using mirror or raidz for this setup is that (as far as I know) you can't add disks to mirror/raidz groups, and if you just add the disk to the pool, you end up in the same situation as above (with more space but no redundancy). You can't add to an existing mirror, but you can add new mirrors (or raidz) items to the pool. If so, there's no loss of redundancy. Maybe I'm missing some context, but you can add to an existing mirror - see zpool attach. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log
Cyril, I wrote this case and implemented the project. My problem was that I didn't know what policy (if any) Sun has about publishing ARC cases, and a mail log with a gazillion email addresses. I did receive an answer to this this in the form: http://www.opensolaris.org/os/community/arc/arc-faq/arc-publish-historical-checklist/ Never having done this it seems somewhat burdensome, and will take some time. Sorry, for the slow response and lack of feedback. Are there any particular questions you have about separate intent logs that I can answer before I embark on the process? Neil. Cyril Plisko wrote: Hello, This is a third request to open the materials of the PSARC case 2007/171 ZFS Separate Intent Log I am not sure why two previous requests were completely ignored (even when seconded by another community member). In any case that is absolutely unaccepted practice. On 6/30/07, Cyril Plisko [EMAIL PROTECTED] wrote: Hello ! I am adding zfs-discuss as it directly relevant to this community. On 6/23/07, Cyril Plisko [EMAIL PROTECTED] wrote: Hi, can the materials of the above be open for the community ? -- Regards, Cyril -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log
Cyril Plisko wrote: On 7/7/07, Neil Perrin [EMAIL PROTECTED] wrote: Cyril, I wrote this case and implemented the project. My problem was that I didn't know what policy (if any) Sun has about publishing ARC cases, and a mail log with a gazillion email addresses. I did receive an answer to this this in the form: http://www.opensolaris.org/os/community/arc/arc-faq/arc-publish-historical-checklist/ Never having done this it seems somewhat burdensome, and will take some time. Neil, I am glad the message finally got through. It seems to me that the URL above refers to the publishing materials of *historical* cases. Do you think the case in hand should be considered historical ? Yes, this was what I was asked to do. Looking more closely it doesn't look too bad. I'll start this process. Anyway, many ZFS related cases were openly reviewed from the moment zero of their life, why this one was an exception ? There's no good reason. Certainly the ideas had been kicked around on the alias, but I agree there was no specific proposal and call for discussion. Sorry, for the slow response and lack of feedback. Are there any particular questions you have about separate intent logs that I can answer before I embark on the process? Well, that only question I have now is what is it all about ? It is hard to ask question without access to case materials, right ? So I've attached the accepted proposal. There was (as expected) not much discussion of this case as it was considered an obvious extension. The actual psarc case materials when opened will not have much more info than this. Hope this helps: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Take Three: PSARC 2007/171 ZFS Separate Intent Log
Er with attachment this time. So I've attached the accepted proposal. There was (as expected) not much discussion of this case as it was considered an obvious extension. The actual psarc case materials when opened will not have much more info than this. PSARC CASE: 2007/171 ZFS Separate Intent Log SUMMARY: This is a proposal to allow separate devices to be used for the ZFS Intent Log (ZIL). The sole purpose of this is performance. The devices can be disks, solid state drives, nvram drives, or any device that presents a block interface. PROBLEM: The ZIL satisfies the synchronous requirements of POSIX. For instance, databases often require their transactions to be on stable storage on return from the system call. NFS and other applications can also use fsync() to ensure data stability. The speed of the ZIL is therefore essential in determining the latency of writes for these critical applications. Currently the ZIL is allocated dynamically from the pool. It consists of a chain of varying block sizes which are anchored in fixed objects. Blocks are sized to fit the demand and will come from different metaslabs and thus different areas of the disk. This causes more head movement. Furthermore, the log blocks are freed as soon as the intent log transaction (system call) is committed. So a swiss cheesing effect can occur leading to pool fragmentation. PROPOSED SOLUTION: This proposal takes advantage of the greatly faster media speeds of nvram, solid state disks, or even dedicated disks. To this end, additional extensions to the zpool command are defined: zpool create pool pool devices log log devices Creates a pool with a separate log. If more than one log device is specified then writes are load-balanced between devices. It's also possible to mirror log devices. For example a log consisting of two sets of two mirrors could be created thus: zpool create pool pool devices \ log mirror c1t8d0 c1t9d0 mirror c1t10d0 c1t11d0 A raidz/raidz2 log is not supported zpool add pool log log devices Creates a separate log if it doesn't exist, or adds extra devices if it does. zpool remove pool log devices Remove the log devices. If all log devices are removed we revert to placing the log in the pool. Evacuating a log is easily handled by ensuring all txgs are committed. zpool replace pool old log device new log device Replace old log device with new log device. zpool attach pool log device new log device Attaches a new log device to an existing log device. If the existing device is not a mirror then a 2 way mirror is created. If device is part of a two-way log mirror, attaching new_device creates a three-way log mirror, and so on. zpool detach pool log device Detaches a log device from a mirror. zpool status Additionally displays the log devices zpool iostat Additionally shows IO statistics for log devices. zpool export/import Will export and import the log devices. When a separate log that is not mirrored fails then logging will start using chained logs within the main pool. The name log will become a reserved word. Attempts to create a pool with the name log will fail with: cannot create 'log': name is reserved pool name may have been omitted Hot spares cannot replace log devices. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] separate intent log blog
I wrote up a blog on the separate intent log called slog blog which describes the interface; some performance results; and general status: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] separate intent log blog
Albert Chin wrote: On Wed, Jul 18, 2007 at 01:29:51PM -0600, Neil Perrin wrote: I wrote up a blog on the separate intent log called slog blog which describes the interface; some performance results; and general status: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on So, how did you get a pci Micro Memory pci1332,5425 card :) I presume this is the PCI-X version. I wasn't involved in the aquisition but was just sent one internally for testing. Yes it's PCI-X. I assume your asking because they can not (or no longer) be obtained? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] separate intent log blog
Adolf, Yes, there was a separate driver, that I believe came from Micro Memories. I installit from a package umem_Sol_Drv_Cust_i386_v01_10.pkg. I just use pkgadd on it and it just worked. Sorry, I don't know if it's publicly available or will even work for your device. I gave details of that device for completeness. I was hoping it would be representative of any NVRAM. I wasn't intending to endorse its use, although it does seem fast. Hardware availability and access to drivers is indeed an issue. 256M is not a lot of NVRAM - the devive I tested had 1GB. If you have a lot of synchronous transactions then you could exceed the 256MB and overflow into the slower main pool. Neil. Adolf Hohl wrote: Hi, what is necessary to get it working from the solaris side. Is a driver on board or is there no special one needed? I just got a packed MM-5425CN with 256M. However i am lacking a pci-x 64bit connector and not sure if it is worth the whole effort for my personal purposes. Any comment are very appreciated -ah This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, ZIL, vq_max_pending and OSCON
Jay, Slides look good, though I'm not not sure what you say along with Filthy lying on slide 22 related to the ZIL, or slide 27 which has Worst Feature - thinks hardware is stupid. Anyway I have some comments on http://www.meangrape.com/2007/08/oscon-zfs You say: --- Records in the ZIL are discarded in a number of circumstances: * a DMU transaction group completes and is committed to stable storage * a write flagged O_DSYNC completes * an fsync() call is completed * a ZFS filesystem is successfully unmounted Your fist bullet is correct: in-memory and stable storage intent log records are discarded when the dmu transaction group is committed to stable storage. However, this is the only time they are discarded. A O_DSYNC or fsync will cause in-memory records to be written to the stable storage intent log. When unmounting, if there are any uncommitted transactions we wait for that DMU transaction group to commit. Most of this is explained in: http://blogs.sun.com/perrin/entry/the_lumberjack Hope that helps: Neil. Jay Edwards wrote: The slides from my ZFS presentation at OSCON (as well as some additional information) are available at _http://www.meangrape.com/2007/08/oscon-zfs/_ Jay Edwards [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] http://www.meangrape.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi storage for virtual machines
How does ZFS handle snapshots of large files like VM images? Is replication done on the bit/block level or by file? In otherwords, does a snapshot of a changed VM image take up the same amount of space as the image or only the amount of space of the bits that have changed within the image? ZFS uses Copy On Write to implement snap shots. No replication is done. When changes are made only the blocks changed are different (the originals are kept by the snapshot). Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Serious ZFS problems
Tim Spriggs wrote: Hello, I think I have gained sufficient fool status for testing the fool-proof-ness of zfs. I have a cluster of T1000 servers running Solaris 10 and two x4100's running an OpenSolaris dist (Nexenta) which is at b68. Each T1000 hosts several zones each of which has its own zpool associated with it. Each zpool is a mirrored configuration between and IBM N series Nas and another OSOL box serving iscsi from zvols. To move zones around, I move the zone configuration and then move the zpool from one T1000 to another and bring the zone up. Now for the problem. For sake of brevity: T1000-1: zpool export pool1 T1000-2: zpool export pool2 T1000-3: zpool import -f pool1 T1000-4: zpool import -f pool2 and other similar operations to move zone data around. Then I 'init 6'd all the T1000s. The reason for the init 6 was so that all of the pools would completely let go of the iscsi luns so I can remove static-configurations from each T1000. upon reboot, pool1 has the following problem: WARNING: can't process intent log for pool1 During pool startup (spa_load()) zil_claim() is called on each dataset in the pool and the first thing it tries to do is open the dataset (dmu_objset_open()). If this fails then the can't process intent log... is printed. So you have a pretty serious pool consistency problem. I guess more information is needed. Running zdb on the pool would be useful, or zdb -l device to display the labels (on a exported pool). and then attempts to export the pool fail with: cannot open 'pool1': I/O error pool2 can consistently make a T1000 (Sol1) kernel panic when imported. It will also make an x4100 panic (osol) Any ideas? Thanks in advance. -Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mixing SATA PATA Drives
Yes performance will suffer, but it's a bit difficult to say by how much. Both pool transaction group writes and zil writes are spread across all devices. It depends on what applications you will run as to how much use is made of the zil. Maybe you should experiment and see if performance is good enough. Neil. Tim Spriggs wrote: I'm far from an expert but my understanding is that the zil is spread across the whole pool by default so in theory the one drive could slow everything down. I don't know what it would mean in this respect to keep the PATA drive as a hot spare though. -Tim Christopher Gibbs wrote: Anyone? On 9/14/07, Christopher Gibbs [EMAIL PROTECTED] wrote: I suspect it's probably not a good idea but I was wondering if someone could clarify the details. I have 4 250G SATA(150) disks and 1 250G PATA(133) disk. Would it cause problems if I created a raidz1 pool across all 5 drives? I know the PATA drive is slower so would it slow the access across the whole pool or just when accessing that disk? Thanks for your input. - Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs log device (zil) ever coming to Sol10?
Separate log devices (slogs) didn't make it into S10U4 but will be in U5. Andy Lubel wrote: I think we are very close to using zfs in our production environment.. Now that I have snv_72 installed and my pools set up with NVRAM log devices things are hauling butt. I've been digging to find out whether this capability would be put into Solaris 10, does anyone know? If not, then I guess we can probably be OK using SXCE (as Joyent did). Thanks, Andy Lubel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs log device (zil) ever coming to Sol10?
Matty wrote: On 9/18/07, Neil Perrin [EMAIL PROTECTED] wrote: Separate log devices (slogs) didn't make it into S10U4 but will be in U5. This is awesome! Will the SYNC_NV support that was integrated this week be added to update 5 as well? That would be super useful, assuming the major arrays vendors support it. I believe it will. So far we have just batched up all the bug fixes and enhancements in ZFS and all of them are integrated into the next update. It's easier for us that way as well. Actually the part of we is not usually played by me! Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] enlarge a mirrored pool
Erik Trimble wrote: Ivan Wang wrote: Hi all, Forgive me if this is a dumb question. Is it possible for a two-disk mirrored zpool to be seamlessly enlarged by gradually replacing previous disk with larger one? Say, in a constrained desktop, only space for two internal disks is available, could I just begin with two 160G disks, then at some time, replace one of the 160G with 250G, resilvering, then replace another 160G, and finally get a two-disk 250G mirrored pool? Cheers, Ivan. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Yes. After both drives are replaced, you will automatically see the additional space. I believe currently after the last replace an import/export sequence is needed to force zfs to see the increased size. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] characterizing I/O on a per zvol basis.
I don't know of any way to observe IOPS per zvol and I believe this would be tricky. Any writes/reads from individual datasets (filesystems and zvols) will go through the pipeline and can fan out to multiple mirrors or raidz or be striped across devices. Block writes will be combined and pushed out in transaction groups, but if synchronous will also have separate (and possibly multiple) intent log writes. Reads if not cached can similarly come from multiple locations. The individual IOs are not tagged with the dataset(s) they are servicing. It would be easier to observe the byte count and read/write request count for a zvol using dtrace. Neil. Nathan Kroenert wrote: Hey all - Time for my silly question of the day, and before I bust out vi and dtrace... If there a simple, existing way I can observe the read / write / IOPS on a per-zvol basis? If not, is there interest in having one? Cheers! Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL reliability/replication questions
Scott Laird wrote: I'm debating using an external intent log on a new box that I'm about to start working on, and I have a few questions. 1. If I use an external log initially and decide that it was a mistake, is there a way to move back to the internal log without rebuilding the entire pool? It's not currently possible to remove a separate log. This was working once, but was stripped out until the more generic zpool remove devices was provided. This is bug 6574286: http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 2. What happens if the logging device fails completely? Does this damage anything else in the pool, other then potentially losing in-flight transactions? This should work. It shouldn't even lose the in-flight transactions. ZFS reverts to using the main pool if a slog write fails or the slog fills up. 3. What about corruption in the log? Is it checksummed like the rest of ZFS? Yes it's checksummed, but the checksumming is a bit different from the pool blocks in the uberblock tree. See also: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on Thanks. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL reliability/replication questions
Scott Laird wrote: On 10/18/07, Neil Perrin [EMAIL PROTECTED] wrote: Scott Laird wrote: I'm debating using an external intent log on a new box that I'm about to start working on, and I have a few questions. 1. If I use an external log initially and decide that it was a mistake, is there a way to move back to the internal log without rebuilding the entire pool? It's not currently possible to remove a separate log. This was working once, but was stripped out until the more generic zpool remove devices was provided. This is bug 6574286: http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 Okay, so hopefully it'll work in a couple quarters? It's not being worked on currently but hopefully will be fixed in 6 months. 2. What happens if the logging device fails completely? Does this damage anything else in the pool, other then potentially losing in-flight transactions? This should work. It shouldn't even lose the in-flight transactions. ZFS reverts to using the main pool if a slog write fails or the slog fills up. So, the only way to lose transactions would be a crash or power loss, leaving outstanding transactions in the log, followed by the log device failing to start up on reboot? I assume that that would that be handled relatively cleanly (files have out of data data), as opposed to something nasty like the pool fails to start up. I just checked on the behaviour of this. The log is treated as part of the main pool. If it is not replicated and disappears then the pool can't be opened - just like any unreplicated device in the main pool. If the slog is found but can't be opened or is corrupted then then the pool will be opened but the slog isn't used. This seems a bit inconsistent. 3. What about corruption in the log? Is it checksummed like the rest of ZFS? Yes it's checksummed, but the checksumming is a bit different from the pool blocks in the uberblock tree. See also: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on That started this whole mess :-). I'd like to try out using one of the Gigabyte SATA ramdisk cards that are discussed in the comments. A while ago there was a comment on this alias that these cards weren't purchasable. Unfortunately, I don't know what is available. It supposedly has 18 hours of battery life, so a long-term power outage would kill the log. I could reasonably expect one 18+ hour power outage over the life of the filesystem. I'm fine with losing in-flight data (I'd expect the log to be replayed before the UPS shuts the system down anyway), but I'd rather not lose the whole pool or something extreme like that. I'm willing to trade the chance of some transaction losses during an exceptional event for more performance, but I'd rather not have to pull out the backups if I can ever avoid it. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL reliability/replication questions
Scott Laird wrote: On 10/18/07, Neil Perrin [EMAIL PROTECTED] wrote: So, the only way to lose transactions would be a crash or power loss, leaving outstanding transactions in the log, followed by the log device failing to start up on reboot? I assume that that would that be handled relatively cleanly (files have out of data data), as opposed to something nasty like the pool fails to start up. I just checked on the behaviour of this. The log is treated as part of the main pool. If it is not replicated and disappears then the pool can't be opened - just like any unreplicated device in the main pool. If the slog is found but can't be opened or is corrupted then then the pool will be opened but the slog isn't used. This seems a bit inconsistent. Hmm, yeah. What would happen if I mirrored the ramdisk with a hard drive? Would ZFS block until the data's stable on both devices, or would it continue once the write is complete on the ramdisk? ZFS ensures all mirror sides have the data before returning. Failing that, would replacing the missing log with a blank device let me bring the pool back up, or would it be dead at that point? Replacing the device would work: : mull ; mkfile 100m /p1 /p2 : mull ; zpool create whirl /p1 log /p2 : mull ; echo abc /whirl/f : mull ; sync : mull ; rm /p2 : mull ; sync reset system : mull ; zpool status pool: whirl state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAMESTATE READ WRITE CKSUM whirl UNAVAIL 0 0 0 insufficient replicas /p1 ONLINE 0 0 0 logsUNAVAIL 0 0 0 insufficient replicas /p2 UNAVAIL 0 0 0 cannot open : mull ; mkfile 100m /p2 /p3 : mull ; zpool online whirl /p2 warning: device '/p2' onlined, but remains in faulted state use 'zpool replace' to replace devices that are no longer present : mull ; zpool status pool: whirl state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAMESTATE READ WRITE CKSUM whirl ONLINE 0 0 0 /p1 ONLINE 0 0 0 logsONLINE 0 0 0 /p2 UNAVAIL 0 0 0 corrupted data errors: No known data errors : mull ; zpool replace whirl /p2 /p3 : mull ; zpool status pool: whirl state: ONLINE scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007 config: NAME STATE READ WRITE CKSUM whirlONLINE 0 0 0 /p1ONLINE 0 0 0 logs ONLINE 0 0 0 replacing ONLINE 0 0 0 /p2 UNAVAIL 0 0 0 corrupted data /p3 ONLINE 0 0 0 errors: No known data errors : mull ; zpool status pool: whirl state: ONLINE scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007 config: NAMESTATE READ WRITE CKSUM whirl ONLINE 0 0 0 /p1 ONLINE 0 0 0 logsONLINE 0 0 0 /p3 ONLINE 0 0 0 errors: No known data errors : mull ; zfs mount : mull ; zfs mount -a : mull ; cat /whirl/f abc : mull ; 3. What about corruption in the log? Is it checksummed like the rest of ZFS? Yes it's checksummed, but the checksumming is a bit different from the pool blocks in the uberblock tree. See also: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on That started this whole mess :-). I'd like to try out using one of the Gigabyte SATA ramdisk cards that are discussed in the comments. A while ago there was a comment on this alias that these cards weren't purchasable. Unfortunately, I don't know what is available. The umem one is unavailable, but the Gigabyte model is easy to find. I had Amazon overnight one to me, it's probably sitting at home right now. Cool let us know how it goes. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. Neil. Joe Little wrote: I have historically noticed that in ZFS, when ever there is a heavy writer to a pool via NFS, the reads can held back (basically paused). An example is a RAID10 pool of 6 disks, whereby a directory of files including some large 100+MB in size being written can cause other clients over NFS to pause for seconds (5-30 or so). This on B70 bits. I've gotten used to this behavior over NFS, but didn't see it perform as such when on the server itself doing similar actions. To improve upon the situation, I thought perhaps I could dedicate a log device outside the pool, in the hopes that while heavy writes went to the log device, reads would merrily be allowed to coexist from the pool itself. My test case isn't ideal per se, but I added a local 9GB SCSI (80) drive for a log, and added to LUNs for the pool itself. You'll see from the below that while the log device is pegged at 15MB/sec (sd5), my directory list request on devices sd15 and sd16 never are answered. I tried this with both no-cache-flush enabled and off, with negligible difference. Is there anyway to force a better balance of reads/writes during heavy writes? extended device statistics devicer/sw/s kr/s kw/s wait actv svc_t %w %b fd0 0.00.00.00.0 0.0 0.00.0 0 0 sd0 0.00.00.00.0 0.0 0.00.0 0 0 sd1 0.00.00.00.0 0.0 0.00.0 0 0 sd2 0.00.00.00.0 0.0 0.00.0 0 0 sd3 0.00.00.00.0 0.0 0.00.0 0 0 sd4 0.00.00.00.0 0.0 0.00.0 0 0 sd5 0.0 118.00.0 15099.9 0.0 35.0 296.7 0 100 sd6 0.00.00.00.0 0.0 0.00.0 0 0 sd7 0.00.00.00.0 0.0 0.00.0 0 0 sd8 0.00.00.00.0 0.0 0.00.0 0 0 sd9 0.00.00.00.0 0.0 0.00.0 0 0 sd10 0.00.00.00.0 0.0 0.00.0 0 0 sd11 0.00.00.00.0 0.0 0.00.0 0 0 sd12 0.00.00.00.0 0.0 0.00.0 0 0 sd13 0.00.00.00.0 0.0 0.00.0 0 0 sd14 0.00.00.00.0 0.0 0.00.0 0 0 sd15 0.00.00.00.0 0.0 0.00.0 0 0 sd16 0.00.00.00.0 0.0 0.00.0 0 0 ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
Roch - PAE wrote: Neil Perrin writes: Joe Little wrote: On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. No there's no way currently to give reads preference over writes. All transactions get equal priority to enter a transaction group. Three txgs can be outstanding as we use a 3 phase commit model: open; quiescing; and syncing. That makes me wonder if this is not just the lack of write throttling issue. If one txg is syncing and the other is quiesced out, I think it means we have let in too many writes. We do need a better balance. Neil is it correct that reads never hit txg_wait_open(), but they just need an I/O scheduler slot ? Yes, they don't modify any meta data (except access time which is handled separately). I'm less clear about what happens further down in the DMU and SPA. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write frequency
Ajay Kumar wrote: IHAC who would like to understand following: We've upgraded a box to sol10-u4 and created a ZFS pool. We notice that running zfs iostat 1 or iostat -xnz 1, the data gets written to disk every 5 seconds, even though the data is being copied to the filesystem continuously. This behavior is different than UFS as UFS continuously writes. So, what's with the 5 second pause? ZFS creates transactions for systems calls that modify the pool. For efficiency it gathers together individual transactions into transaction groups (txgs) which are committed every 5 seconds. If you are seeing some constant background write activity then that is probably due to synchronous writes which require data be stable on return from the system call. These are written on demand to an intent log. Any clarification will be appreciated. Thank you Ajay ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bugid 6535160
Vincent Fox wrote: So does anyone have any insight on BugID 6535160? We have verified on a similar system, that ZFS shows big latency in filebench varmail test. We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 ms. This is such a big difference it makes me think something else is going on. I suspect one of two possible causes: A) The disk write cache is enabled and volatile. UFS knows nothing of write caches and requires the write cache to be disabled otherwise corruption can occur. B) The write cache is non volatile, but ZFS hasn't been configured to stop flushing it (set zfs:zfs_nocacheflush = 1). Note, ZFS enables the write cache and will flush it as necessary. http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1 We run Solaris 10u4 on our production systems, don't see any indication of a patch for this. I'll try downloading recent Nevada build and load it on same system and see if the problem has indeed vanished post snv_71. Yes please try this. I think it will make a difference but the delta will be small. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Bugid 6535160
Vincent Fox wrote: So does anyone have any insight on BugID 6535160? We have verified on a similar system, that ZFS shows big latency in filebench varmail test. We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 ms. This is such a big difference it makes me think something else is going on. I suspect one of two possible causes: A) The disk write cache is enabled and volatile. UFS knows nothing of write caches and requires the write cache to be disabled otherwise corruption can occur. B) The write cache is non volatile, but ZFS hasn't been configured to stop flushing it (set zfs:zfs_nocacheflush = 1). Note, ZFS enables the write cache and will flush it as necessary. http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1 We run Solaris 10u4 on our production systems, don't see any indication of a patch for this. I'll try downloading recent Nevada build and load it on same system and see if the problem has indeed vanished post snv_71. Yes please try this. I think it will make a difference but the delta will be small. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] copy on write related query
sudarshan sridhar wrote: I'm not quite sure what you're asking here. Data, whether newly written or copy-on-write, goes to a newly allocated block, which may reside on any vdev, and will be spread across devices if using RAID. My exact doubt is, if COW is default behavior of ZFS then does COWd data written to the same physical drive where the filesystem resides? Yes. If so the physical device capacity should be more that what the file system size is. Yes. I mean in normal filesystem sinario, a partition with 1Gb with some some filesystem (say ext2fs) is created, then use can save upto 1Gb data under that. This is not true of any filesystem. There is always some overhead for meta data like indirect blocks, journals, superblocks, space maps etc. Some filesystems (eg UFS) have a fixed areas for meta data which limits the number of files and possible data, whereas others dynamically allocate the meta data (eg ZFS). The former is more predictable and the latter more flexible. Is the same behavior with ZFS?. Because I feel since COW is default ZFS require 1Gb for one fileystem inorder to store COWed data. Please correct me if i am wrong. -sridhar Never miss a thing. Make Yahoo your homepage. http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intent logs vs Journaling
parvez shaikh wrote: Hello, I am learning ZFS, its design and layout. I would like to understand how Intent logs are different from journal? Journal too are logs of updates to ensure consistency of file system over crashes. Purpose of intent log also appear to be same. I hope I am not missing something important in these concepts. There is a difference. A journal contains the necessary transactions to make the on-disk fs consistent. The ZFS intent is not needed for consistency. Here's an extract from http://blogs.sun.com/perrin/entry/the_lumberjack : ZFS is always consistent on disk due to its transaction model. Unix system calls can be considered as transactions which are aggregated into a transaction group for performance and committed together periodically. Either everything commits or nothing does. That is, if a power goes out, then the transactions in the pool are never partial. This commitment happens fairly infrequently - typically a few seconds between each transaction group commit. Some applications, such as databases, need assurance that say the data they wrote or mkdir they just executed is on stable storage, and so they request synchronous semantics such as O_DSYNC (when opening a file), or execute fsync(fd) after a series of changes to a file descriptor. Obviously waiting seconds for the transaction group to commit before returning from the system call is not a high performance solution. Thus the ZFS Intent Log (ZIL) was born. Also I read that Updates in ZFS are intrinsically atomic, I cant understand how they are intrinsically atomic http://weblog.infoworld.com/yager/archives/2007/10/suns_zfs_is_clo.html I would be grateful if someone can address my query Thanks Explore your hobbies and interests. Click here to begin. http://in.rd.yahoo.com/tagline_groups_6/*http://in.promos.yahoo.com/groups ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS behavior with fsync() calls
Todd Moore wrote: My understanding is that the answers to the questions posed below are both YES due the transactional design of ZFS. However, I'm working with some folks that need more details or documents describing the design/behavior without having to look through all the source code. [b]Scenario 1[/b] * Create file * Open and Write data to file * Issue fsync() call for file [b]Question:[/b] Is it guaranteed that the write to the directory occurs prior to the write to the file? Yes, this is guaranteed. [b]Scenario 1[/b] * Write an extended attribute (such as a file version number) for a file. * Open and Write data to file * Issue fsync() call for file [b]Question:[/b] Is it guaranteed that the extended attribute write occurs prior to the write to the file? Again yes this is guaranteed in ZFS. ZFS writes all transactions related to specified file and other transactions not related to the file that may be needed to create the file. Additionally, is it possible that there are differences in this behavior as relates to these scenarios between Solaris 10 U4 or a SXDE 01/08 implementation (snv_b79)? No the zfs code has always been this way. The ZIL which handles this behaviour is described at http://blogs.sun.com/perrin/entry/the_lumberjack but this maybe insufficient detail for you. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance on ZFS vs UFS
Steve Hillman wrote: I realize that this topic has been fairly well beaten to death on this forum, but I've also read numerous comments from ZFS developers that they'd like to hear about significantly different performance numbers of ZFS vs UFS for NFS-exported filesystems, so here's one more. The server is an x4500 with 44 drives configured in a RAID10 zpool, and two drives mirrored and formatted with UFS for the boot device. It's running Solaris 10u4, patched with the Recommended Patch Set from late Dec/07. The client (if it matters) is an older V20z w/ Solaris 10 3/05. No tuning has been done on either box The test involved copying lots of small files (2-10k) from an NFS client to a mounted NFS volume. A simple 'cp' was done, both with 1 thread and 4 parallel threads (to different directories) and then I monitored to see how fast the files were accumulating on the server. ZFS: 1 thread - 25 files/second; 4 threads - 25 files/second (~6 per thread) UFS: (same server, just exported /var from the boot volume) 1 thread - 200 files/second; 4 threads - 520 files/second (~130/thread) With this big a difference, I suspect the write cache is enabled on the disks. UFS requires this cache to be disabled or battery backed otherwise corruption can occur. For comparison, the same test was done to a NetApp FAS270 that the x4500 was bought to replace: 1 thread - 70 files/second; 4 threads - ~250 files/second I don't know enough about that system but perhaps it has NVRAM or an SSD to service the synchronous demands of NFS. An equivalent setup could be configured with a separate intent log on a similar fast device. I have been able to work around this performance hole by exporting multiple ZFS filesystems, because the workload is spread across a hashed directory structure. I then get 25 files per FS per second. Still, I thought I'd raise it here anyway. If there's something I'm doing wrong, I'd love to hear about it. I'm also assuming that this ties into BugID 6535160 Lock contention on zl_lock from zil_commit, so if that's the case, please add another vote for making this fix available as a patch for S10u4 users I believe this is a different problem than 6535160. Thanks, Steve Hillman This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
Roch - PAE wrote: Jonathan Loran writes: Is it true that Solaris 10 u4 does not have any of the nice ZIL controls that exist in the various recent Open Solaris flavors? I would like to move my ZIL to solid state storage, but I fear I can't do it until I have another update. Heck, I would be happy to just be able to turn the ZIL off to see how my NFS on ZFS performance is effected before spending the $'s. Anyone know when will we see this in Solaris 10? You can certainly turn it off with any release (Jim's link). It's true that S10u4 does not have the Separate Intent Log to allow using an SSD for ZIL blocks. I believe S10U5 will have that feature. Unfortunately it will not. A lot of ZFS fixes and features that had existed for a while will not be in U5 (for reasons I can't go into here). They should be in S10U6... Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
Jonathan Loran wrote: Vincent Fox wrote: Are you already running with zfs_nocacheflush=1? We have SAN arrays with dual battery-backed controllers for the cache, so we definitely have this set on all our production systems. It makes a big difference for us. No, we're not using the zfs_nocacheflush=1, but our SAN array's are set to cache all writebacks, so it shouldn't be needed. I may test this, if I get the chance to reboot one of the servers, but I'll bet the storage arrays' are working correctly. I think there's some confusion. ZFS and the ZIL issue controller commands to force the disk cache to be flushed to ensure data is on stable storage. If the disk cache is battery backed then the costly flush is unnecessary. As Vincent said, setting zfs_nocacheflush=1 can make a huge difference. Note that this is a system wide variable so all controllers serving ZFS devices should be non volatile to enable it. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance Issue
Marc Bevand wrote: William Fretts-Saxton william.fretts.saxton at sun.com writes: I disabled file prefetch and there was no effect. Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? Possible explanation: the Glassfish applications are using synchronous writes, causing the ZIL (ZFS Intent Log) to be intensively used, which leads to a lot of extra I/O. The ZIL doesn't do a lot of extra IO. It usually just does one write per synchronous request and will batch up multiple writes into the same log block if possible. However, it does need to wait for the writes to be on stable storage before returning to the application, which is what the application has requested. It does this by waiting for the write to complete and then flushing the disk write cache. If the write cache is battery backed for all zpool devices then the global zfs_nocacheflush can be set to give dramatically better performance. Try to disable it: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Since disabling it is not recommended, if you find out it is the cause of your perf problems, you should instead try to use a SLOG (separate intent log, see above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support SLOGs, they have only been added to OpenSolaris build snv_68: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on -marc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 100% random writes coming out as 50/50 reads/writes
Nathan Kroenert wrote: And something I was told only recently - It makes a difference if you created the file *before* you set the recordsize property. If you created them after, then no worries, but if I understand correctly, if the *file* was created with 128K recordsize, then it'll keep that forever... Assuming I understand correctly. Hopefully someone else on the list will be able to confirm. Yes, that is correct. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and multipath with iSCSI
ZFS will handle out of order writes due to it transactional nature. Individual writes can be re-ordered safely. When the transaction commits it will wait for all writes and flush them; then write a new uberblock with the new transaction group number and flush that. Chris Siebenmann wrote: We're currently designing a ZFS fileserver environment with iSCSI based storage (for failover, cost, ease of expansion, and so on). As part of this we would like to use multipathing for extra reliability, and I am not sure how we want to configure it. Our iSCSI backend only supports multiple sessions per target, not multiple connections per session (and my understanding is that the Solaris initiator doesn't currently support multiple connections anyways). However, we have been cautioned that there is nothing in the backend that imposes a global ordering for commands between the sessions, and so disk IO might get reordered if Solaris's multipath load balancing submits part of it to one session and part to another. So: does anyone know if Solaris's multipath and iSCSI systems already take care of this, or if ZFS already is paranoid enough to deal with this, or if we should configure Solaris multipathing to not load-balance? (A load-balanced multipath configuration is simpler for us to administer, at least until I figure out how to tell Solaris multipathing which is the preferrred network for any given iSCSI target so we can balance the overall network load by hand.) Thanks in advance. - cks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] incorrect/conflicting suggestion in error message on a faulted pool
Haudy, Thanks for reporting this bug and helping to improve ZFS. I'm not sure either how you could have added a note to an existing report. Anyway I've gone ahead and done that for you in the Related Bugs field. Though opensolaris doesn't reflect it yet Neil. Haudy Kazemi wrote: I have reported this bug here: http://bugs.opensolaris.org/view_bug.do?bug_id=6685676 I think this bug may be related, but I do not see where to add a note to an existing bug report: http://bugs.opensolaris.org/view_bug.do?bug_id=6633592 (both bugs refer to ZFS-8000-2Q however my report shows a FAULTED pool instead of a DEGRADED pool.) Thanks, -hk Haudy Kazemi wrote: Hello, I'm writing to report what I think is an incorrect or conflicting suggestion in the error message displayed on a faulted pool that does not have redundancy (equiv to RAID0?). I ran across this while testing and learning about ZFS on a clean installation of NexentaCore 1.0. Here is how to recreate the scenario: [EMAIL PROTECTED]:~$ mkfile 200m testdisk1 testdisk2 [EMAIL PROTECTED]:~$ sudo zpool create mybigpool $PWD/testdisk1 $PWD/testdisk2 Password: [EMAIL PROTECTED]:~$ zpool status mybigpool pool: mybigpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM mybigpool ONLINE 0 0 0 /export/home/kaz/testdisk1 ONLINE 0 0 0 /export/home/kaz/testdisk2 ONLINE 0 0 0 errors: No known data errors [EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool [EMAIL PROTECTED]:~$ zpool status mybigpool pool: mybigpool state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Mon Apr 7 22:09:29 2008 config: NAME STATE READ WRITE CKSUM mybigpool ONLINE 0 0 0 /export/home/kaz/testdisk1 ONLINE 0 0 0 /export/home/kaz/testdisk2 ONLINE 0 0 0 errors: No known data errors Up to here everything looks fine. Now lets destroy one of the virtual drives: [EMAIL PROTECTED]:~$ rm testdisk2 [EMAIL PROTECTED]:~$ zpool status mybigpool pool: mybigpool state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Mon Apr 7 22:09:29 2008 config: NAME STATE READ WRITE CKSUM mybigpool ONLINE 0 0 0 /export/home/kaz/testdisk1 ONLINE 0 0 0 /export/home/kaz/testdisk2 ONLINE 0 0 0 errors: No known data errors Okay, still looks fine, but I haven't tried to read/write to it yet. Try a scrub. [EMAIL PROTECTED]:~$ sudo zpool scrub mybigpool [EMAIL PROTECTED]:~$ zpool status mybigpool pool: mybigpool state: FAULTED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: scrub completed after 0h0m with 0 errors on Mon Apr 7 22:10:36 2008 config: NAME STATE READ WRITE CKSUM mybigpool FAULTED 0 0 0 insufficient replicas /export/home/kaz/testdisk1 ONLINE 0 0 0 /export/home/kaz/testdisk2 UNAVAIL 0 0 0 cannot open errors: No known data errors [EMAIL PROTECTED]:~$ There we go. The pool has faulted as I expected to happen because I created it as a non-redundant pool. I think it was the equivalent of a RAID0 pool with checksumming, at least it behaves like one. The key to my reporting this is that the status message says One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. while the message further down to the right of the pool name says insufficient replicas. The verbose status message is wrong in this case. From other forum/list posts looks like that status message is also used for degraded pools, which isn't a problem, but here we have a faulted pool. Here's an example of the same status message used appropriately: http://mail.opensolaris.org/pipermail/zfs-discuss/2006-April/031298.html Is anyone else able to reproduce this? And if so, is there a ZFS bug tracker to report this too? (I didn't see a public bug tracker when I looked.) Thanks, Haudy Kazemi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pause Solaris with ZFS compression busy by doing a cp?
I also noticed (perhaps by design) that a copy with compression off almost instantly returns, but the writes continue LONG after the cp process claims to be done. Is this normal? Yes this is normal. Unless the application is doing synchronous writes (eg DB) the file will be written to disk at the convenience of the FS. Most fs operate this way. It's too expensive to synchronously write out data, so it's batched up and written asynchronously. Wouldn't closing the file ensure it was written to disk? No. Is that tunable somewhere? No. For ZFS you can use sync(1M) which will force out all transactions for all files in the pool. That is expensive though. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
Hugh Saunders wrote: On Sat, May 24, 2008 at 4:00 PM, [EMAIL PROTECTED] wrote: cache improve write performance or only reads? L2ARC cache device is for reads... for write you want Intent Log Thanks for answering my question, I had seen mention of intent log devices, but wasn't sure of their purpose. If only one significantly faster disk is available, would it make sense to slice it and use a slice for L2ARC and a slice for ZIL? or would that cause horrible thrashing? I wouldn't recommend this configuration. As you say it would thrash the head. Log devices mainly need to write fast as they only ever are read once on reboot if there's uncommitted transactions. Whereas cache devices require a fast read as the write can be done slowly and asynchronously. So a common device sliced for use as both purposes wouldn't work well unless it was both fast read and write and had minimal seek times (nvram, ss disk). Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog devices don't resilver correctly
Joe Little wrote: On Tue, May 27, 2008 at 4:50 PM, Eric Schrock [EMAIL PROTECTED] wrote: Joe - We definitely don't do great accounting of the 'vdev_islog' state here, and it's possible to create a situation where the parent replacing vdev has the state set but the children do not, but I have been unable to reproduce the behavior you saw. I have rebooted the system during resilver, manually detached the replacing vdev, and a variety of other things, but I've never seen the behavior you describe. In all cases, the log state is kept with the replacing vdev and restored when the resilver completes. I have also not observed the resilver failing with a bad log device. Can you provide more information about how to reproduce this problem? Perhaps without rebooting into B70 in the middle? Well, this happened live on a production system, and I'm still in the process of rebuilding said system (trying to save all the snapshots) I don't know what triggered it. It was trying to resilver in B85, rebooted into B70 where it did resilver (but it was now using cmdk device naming vs the full scsi device names). It was marked degraded still even though re-silvering finished. Since the resilver took so long, I suspect the splicing in of the device took place in the B70. Again, it would never work in B85 -- just kept resetting. I'm wondering if the device path changing from cxtxdx to cxdx could be the trigger point. Joe, We're sorry about your problems. My take on how this is best handled, is that it be be better to expedite (raise priority) fixing the bug 6574286 removing a slog doesn't work rather than expend too much effort in understanding how it failed on your system. You would not have had this problem if you were able to remove a log device. Is that reasonable? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE 4852783
This is actually quite a tricky fix as obviously data and meta data have to be relocated. Although there's been no visible activity in this bug there has been substantial design activity to allow the RFE to be easily fixed. Anyway, to answer your question, I would fully expect this RFE would be fixed within a year, but can't guarantee it. Neil. Miles Nordin wrote: Is RFE 4852783 (need for an equivalent to LVM2's pvmove) likely to happen within the next year? My use-case is home user. I have 16 disks spinning, two towers of eight disks each, exporting some of them as iSCSI targets. Four disks are 1TB disks already in ZFS mirrors, and 12 disks are 180 - 320GB and contain 12 individual filesystems. If RFE 4852783 will happen in a year, I can move the smaller disks and their data into the ZFS mirror. As they die I will replace them with pairs of ~1TB disks. I worry the RFE won't happen because it looks 5 years old with no posted ETA. If it won't be closed within a year, some of those 12 disks will start failing and need replacement. We find we lose one or two each year. If I added them to ZFS, I'd have to either waste money, space, power on buying undersized replacement disks, or else do silly and dangerously confusing things with slices. Therefore in that case I will leave the smaller disks out of ZFS and add only 1TB devices to these immutable vdev's. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Caching - write() syscall with O_SYNC
Patrick Pinchera wrote: IHAC using ZFS in production, and he's opening up some files with the O_SYNC flag. This affects subsequent write()'s by providing synchronized I/O file integrity completion. That is, each write(2) will wait for both the file data and file status to be physically updated. Because of this, he's seeing some delays on the file write()'s. This is verified with dtrace. He's got a storage array with a read/write cache already. What does ZFS introduce to this O_SYNC flag? Is ZFS doing some caching itself, too? Yes, but not in the path of the synchronous request. The latency isn't affected by other ZFS caching. Are there settings we got by default when we created the ZFS pools that already give us the equivalent of O_SYNC? No. Is there something we should consider turning on or off with regard to ZFS? Yes, because your write cache is non-volatile you can disable the zfs write cache flush. See: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes Note this should only really be done if ZFS is the only user of the storage array. My feeling is that in an effort to make these write()'s so that they go completely to the disk, we may have gone overboard with one or more of the following: * setting O_SYNC on the file open() to affect the write()'s * using ZFS * using a storage array with a battery backed up read/write cache Can we eliminate one or more of these and still get the file integrity we want? PRD;IANOTA Regards, Pat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. Mertol Ozyoney wrote: Hi All ; Is there any hope for deduplication on ZFS ? Mertol http://www.sun.com/emrkt/sigs/6g_top.gif http://www.sun.com/ *Mertol Ozyoney * Storage Practice - Sales Manager *Sun Microsystems, TR* Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss Digest, Vol 33, Issue 19
Ross wrote: Hi Gilberto, I bought a Micro Memory card too, so I'm very likely going to end up in the same boat. I saw Neil Perrin's blog about the MM-5425 card, found that Vmetro don't seem to want to sell them, but then then last week spotted five of those cards on e-bay so snapped them up. I'm still waiting for the hardware for this server, but regarding the drivers, if these cards don't work out of the box I was planning to pester Neil Perrin and see if he still has some drivers for them :) Unfortunately, there are a couple of problems: 1. It's been a while since I used that board and driver. I recently tried pkgadd-ing on the latest Nevada build and it hung. I'm not sure if the latest Nevada is somehow incompatible. I didn't have time to track down the cause. 2. I received the board and driver from another group within Sun. It would be better to contact Micro Memory (or whoever took them over) directly, as it's not my place to give out 3rd party drivers or provide support for them. Sorry for the bad news: Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive removed
Peter Cudhea wrote: Your point is well taken that ZFS should not duplicate functionality that is already or should be available at the device driver level.In this case, I think it misses the point of what ZFS should be doing that it is not. ZFS does its own periodic commits to the disk, and it knows if those commit points have reached the disk or not, or whether they are getting errors.In this particular case, those commits to disk are presumably failing, because one of the disks they depend on has been removed from the system. (If the writes are not being marked as failures, that would definitely be an error in the device driver, as you say.) In this case, however, the ZIL log has stopped being updated, but ZFS does nothing to announce that this has happened, or to indicate that a remedy is required. I think you have some misconceptions about how the ZIL works. It doesn't provide journalling like UFS. The following might help: http://blogs.sun.com/perrin/entry/the_lumberjack The ZIL isn't used at all unless there's fsync/O_DSYNC activity. At the very least, it would be extremely helpful if ZFS had a status to report that indicates that the ZIL log is out of date, or that there are troubles writing to the ZIL log, or something like that. If the ZIL cannot be written then we force a transaction group (txg) commit. That is the only recourse to force data to stable storage before returning to the application. An additional feature would be to have user-selectable behavior when the ZIL log is significantly out of date.For example, if the ZIL log is more than X seconds out of date, then new writes to the system should pause, or give errors or continue to silently succeed. Again this doesn't make sense given how the ZIL works. In an earlier phase of my career when I worked for a database company, I was responsible for a similar bug. It caused a major customer to lose a major amount of data when a system rebooted when not all good data had been successfully committed to disk.The resulting stink caused us to add a feature to detect the cases when the writing-to-disk process had fallen too far behind, and to pause new writes to the database until the situation was resolved. Peter Bob Friesenhahn wrote: While I do believe that device drivers. or the fault system, should notify ZFS when a device fails (and ZFS should appropriately react), I don't think that ZFS should be responsible for fault monitoring. ZFS is in a rather poor position for device fault monitoring, and if it attempts to do so then it will be slow and may misbehave in other ways. The software which communicates with the device (i.e. the device driver) is in the best position to monitor the device. The primary goal of ZFS is to be able to correctly read data which was successfully committed to disk. There are programming interfaces (e.g. fsync(), msync()) which may be used to ensure that data is committed to disk, and which should return an error if there is a problem. If you were performing your tests over an NFS mount then the results should be considerably different since NFS requests that its data be committed to disk. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss