Re: [zfs-discuss] Poor read/write performance when using ZFS iSCSI target
initiator_host:~ # dd if=/dev/zero bs=1k of=/dev/dsk/c5t0d0 count=100 So this is going at 3000 x 1K writes per second or 330usec per write. The iscsi target is probably doing a over the wire operation for each request. So it looks fine at first glance. -r Cody Campbell writes: Greetings, I want to take advantage of the iSCSI target support in the latest release (svn_91) of OpenSolaris, and I'm running into some performance problems when reading/writing from/to my target. I'm including as much detail as I can so bear with me here... I've built an x86 OpenSolaris server (Intel Xeon running NV_91) with a zpool of 15 750GB SATA disks, of which I've created and exported a ZFS Volume with the shareiscsi=on property set to generate an iSCSI target. My problem is, when I connect to this target from any initiator (tested with both Linux 2.6 and OpenSolaris NV_91 SPARC and x86), the read/write speed is dreadful (~ 3 megabytes / second!). When I test read/write performance locally with the backing pool, I have excellent speeds. The same can be said when I use services such as NFS and FTP to move files between other hosts on the network and the volume I am exporting as a Target. When doing this I have achieved the near-Gigabit speeds I expect, which has me thinking this isn't a network problem of some sort (I've already disabled the Neagle algorithm if you're wondering). It's not until I add the iSCSI target to the stack that the speeds go south, so I am concerned that I may be missing something in configuration of the target. Below are some details pertaining to my configuration. OpenSolaris iSCSI Target Host: target_host:~ # zpool status pool0 pool: pool0 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM pool0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 spares c1t6d0AVAIL errors: No known data errors target_host:~ # zfs get all pool0/vol0 NAMEPROPERTY VALUE SOURCE pool0/vol0 type volume - pool0/vol0 creation Wed Jul 2 18:16 2008 - pool0/vol0 used 5T - pool0/vol0 available7.92T - pool0/vol0 referenced 34.2G - pool0/vol0 compressratio1.00x - pool0/vol0 reservation none default pool0/vol0 volsize 5T - pool0/vol0 volblocksize 8K - pool0/vol0 checksum on default pool0/vol0 compression offdefault pool0/vol0 readonly offdefault pool0/vol0 shareiscsi on local pool0/vol0 copies 1 default pool0/vol0 refreservation 5T local target_host:~ # iscsitadm list target -v pool0/vol0 Target: pool0/vol0 iSCSI Name: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab Alias: pool0/vol0 Connections: 1 Initiator: iSCSI Name: iqn.1986-03.com.sun:01:0003ba681e7f.486c0829 Alias: unknown ACL list: TPGT list: TPGT: 1 LUN information: LUN: 0 GUID: 01304865b1b42a00486c29d2 VID: SUN PID: SOLARIS Type: disk Size: 5.0T Backing store: /dev/zvol/rdsk/pool0/vol0 Status: online OpenSolaris iSCSI Initiator Host: initiator_host:~ # iscsiadm list target -vS iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab Target: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab Alias: pool0/vol0 TPGT: 1 ISID: 402a Connections: 1 CID: 0 IP address (Local): 192.168.4.2:63960 IP address (Peer): 192.168.4.3:3260 Discovery Method: SendTargets Login Parameters (Negotiated):
Re: [zfs-discuss] Why RAID 5 stops working in 2009
Kyle McDonald writes: Ross wrote: Just re-read that and it's badly phrased. What I meant to say is that a raid-z / raid-5 array based on 500GB drives seems to have around a 1 in 10 chance of loosing some data during a full rebuild. Actually, I think it's been explained already why this is actually one area where RAID-Z will really start to show some of the was it's different than it's RAID-5 ancestors. For one, A RAID-5 controller has no idea of the filesystem, and there for has to rebuild every bit on the disk, whether it's used or not, and if it cant' it will declare the whole array unusable. RAID-Z on the other hand since it is integrated with the filesystem, only needs to rebuild the *used* data, and won't care if unused parts of the disks can't be rebuilt. Second, a factor that the author of that article leaves out is that decent RAID-5, and RAID-Z can do 'scrubs' of the data at regular intervals, and this will many times catch and deal with these read problems well before they have a chance to take all your data with them. The types of errors the author writes about many times are caused by how accurately the block was written and not a defect of the media, so many times they can be fixed by just rewriting the data to the same block. On ZFS this will almost never happen, because of COW it will always choose a new block to write to. I don't think many (if any) RAID-5 implementaions can change the location of data on a drive. Moreover, ZFS stores redundant copies of metadata so even if a full raid-z stripe goes south, we can still rebuild most of pool data. It seems that at worst, such double failures would lead to a handful of un-recovered files. -r -Kyle This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs_nocacheflush
Peter Tribble writes: A question regarding zfs_nocacheflush: The Evil Tuning Guide says to only enable this if every device is protected by NVRAM. However, is it safe to enable zfs_nocacheflush when I also have local drives (the internal system drives) using ZFS, in particular if the write cache is disabled on those drives? What I have is a local zfs pool from the free space on the internal drives, so I'm only using a partition and the drive's write cache should be off, so my theory here is that zfs_nocacheflush shouldn't have any effect because there's no drive cache in use... Seems plausible, But I'd check that the caches are indeed off using format -e. -r -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic flush
Robert Milkowski writes: Hello Roch, Saturday, June 28, 2008, 11:25:17 AM, you wrote: RB I suspect, a single dd is cpu bound. I don't think so. We're nearly so as you show. More below. Se below one with a stripe of 48x disks again. Single dd with 1024k block size and 64GB to write. bash-3.2# zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test 333K 21.7T 1 1 147K 147K test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 0 0 0 test 333K 21.7T 0 1.60K 0 204M test 333K 21.7T 0 20.5K 0 2.55G test4.00G 21.7T 0 9.19K 0 1.13G test4.00G 21.7T 0 0 0 0 test4.00G 21.7T 0 1.78K 0 228M test4.00G 21.7T 0 12.5K 0 1.55G test7.99G 21.7T 0 16.2K 0 2.01G test7.99G 21.7T 0 0 0 0 test7.99G 21.7T 0 13.4K 0 1.68G test12.0G 21.7T 0 4.31K 0 530M test12.0G 21.7T 0 0 0 0 test12.0G 21.7T 0 6.91K 0 882M test12.0G 21.7T 0 21.8K 0 2.72G test16.0G 21.7T 0839 0 88.4M test16.0G 21.7T 0 0 0 0 test16.0G 21.7T 0 4.42K 0 565M test16.0G 21.7T 0 18.5K 0 2.31G test20.0G 21.7T 0 8.87K 0 1.10G test20.0G 21.7T 0 0 0 0 test20.0G 21.7T 0 12.2K 0 1.52G test24.0G 21.7T 0 9.28K 0 1.14G test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 0 0 0 test24.0G 21.7T 0 14.5K 0 1.81G test28.0G 21.7T 0 10.1K 63.6K 1.25G test28.0G 21.7T 0 0 0 0 test28.0G 21.7T 0 10.7K 0 1.34G test32.0G 21.7T 0 13.6K 63.2K 1.69G test32.0G 21.7T 0 0 0 0 test32.0G 21.7T 0 0 0 0 test32.0G 21.7T 0 11.1K 0 1.39G test36.0G 21.7T 0 19.9K 0 2.48G test36.0G 21.7T 0 0 0 0 test36.0G 21.7T 0 0 0 0 test36.0G 21.7T 0 17.7K 0 2.21G test40.0G 21.7T 0 5.42K 63.1K 680M test40.0G 21.7T 0 0 0 0 test40.0G 21.7T 0 6.62K 0 844M test44.0G 21.7T 1 19.8K 125K 2.46G test44.0G 21.7T 0 0 0 0 test44.0G 21.7T 0 0 0 0 test44.0G 21.7T 0 18.0K 0 2.24G test47.9G 21.7T 1 13.2K 127K 1.63G test47.9G 21.7T 0 0 0 0 test47.9G 21.7T 0 0 0 0 test47.9G 21.7T 0 15.6K 0 1.94G test47.9G 21.7T 1 16.1K 126K 1.99G test51.9G 21.7T 0 0 0 0 test51.9G 21.7T 0 0 0 0 test51.9G 21.7T 0 14.2K 0 1.77G test55.9G 21.7T 0 14.0K 63.2K 1.73G test55.9G 21.7T 0 0 0 0 test55.9G 21.7T 0 0 0 0 test55.9G 21.7T 0 16.3K 0 2.04G test59.9G 21.7T 0 14.5K 63.2K 1.80G test59.9G 21.7T 0 0 0 0 test59.9G 21.7T 0 0 0 0 test59.9G 21.7T 0 17.7K 0 2.21G test63.9G 21.7T 0 4.84K 62.6K 603M test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 test63.9G 21.7T 0 0 0 0 ^C bash-3.2# bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536 65536+0 records in 65536+0 records out real 1:06.312 user0.074 sys54.060 bash-3.2# Doesn't look like it's CPU bound. So if sys we're at 81% of CPU saturation. If you make this 100% you will still have zeros in the zpool iostat. We
Re: [zfs-discuss] Periodic flush
Bob Friesenhahn writes: On Tue, 15 Apr 2008, Mark Maybee wrote: going to take 12sec to get this data onto the disk. This impedance mis-match is going to manifest as pauses: the application fills the pipe, then waits for the pipe to empty, then starts writing again. Note that this won't be smooth, since we need to complete an entire sync phase before allowing things to progress. So you can end up with IO gaps. This is probably what the original submitter is Yes. With an application which also needs to make best use of available CPU, these I/O gaps cut into available CPU time (by blocking the process) unless the application uses multithreading and an intermediate write queue (more memory) to separate the CPU-centric parts from the I/O-centric parts. While the single-threaded application is waiting for data to be written, it is not able to read and process more data. Since reads take time to complete, being blocked on write stops new reads from being started so the data is ready when it is needed. There is one down side to this new model: if a write load is very bursty, e.g., a large 5GB write followed by 30secs of idle, the new code may be less efficient than the old. In the old code, all This is also a common scenario. :-) Presumably the special slow I/O code would not kick in unless the burst was large enough to fill quite a bit of the ARC. Bursts of 1/8th of physical memory or 5 seconds of storage throughput whichever is smallest. -r Real time throttling is quite a challenge to do in software. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance with Sun StorageTek 2540
Bob Friesenhahn writes: On Fri, 15 Feb 2008, Roch Bourbonnais wrote: What was the interlace on the LUN ? The question was about LUN interlace not interface. 128K to 1M works better. The segment size is set to 128K. The max the 2540 allows is 512K. Unfortunately, the StorageTek 2540 and CAM documentation does not really define what segment size means. Any compression ? Compression is disabled. Does turn off checksum helps the number (that would point to a CPU limited throughput). I have not tried that but this system is loafing during the benchmark. It has four 3GHz Opteron cores. Does this output from 'iostat -xnz 20' help to understand issues? extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 3.00.7 26.43.5 0.0 0.00.04.2 0 2 c1t1d0 0.0 154.20.0 19680.3 0.0 20.70.0 134.2 0 59 c4t600A0B80003A8A0B096147B451BEd0 0.0 211.50.0 26940.5 1.1 33.95.0 160.5 99 100 c4t600A0B800039C9B50A9C47B4522Dd0 0.0 211.50.0 26940.6 1.1 33.95.0 160.4 99 100 c4t600A0B800039C9B50AA047B4529Bd0 0.0 154.00.0 19654.7 0.0 20.70.0 134.2 0 59 c4t600A0B80003A8A0B096647B453CEd0 0.0 211.30.0 26915.0 1.1 33.95.0 160.5 99 100 c4t600A0B800039C9B50AA447B4544Fd0 0.0 152.40.0 19447.0 0.0 20.50.0 134.5 0 59 c4t600A0B80003A8A0B096A47B4559Ed0 0.0 213.20.0 27183.8 0.9 34.14.2 159.9 90 100 c4t600A0B800039C9B50AA847B45605d0 0.0 152.50.0 19453.4 0.0 20.50.0 134.5 0 59 c4t600A0B80003A8A0B096E47B456DAd0 0.0 213.20.0 27177.4 0.9 34.14.2 159.9 90 100 c4t600A0B800039C9B50AAC47B45739d0 0.0 213.20.0 27195.3 0.9 34.14.2 159.9 90 100 c4t600A0B800039C9B50AB047B457ADd0 0.0 154.40.0 19711.8 0.0 20.70.0 134.0 0 59 c4t600A0B80003A8A0B097347B457D4d0 0.0 211.30.0 26958.6 1.1 33.95.0 160.6 99 100 c4t600A0B800039C9B50AB447B4595Fd0 Interesting that a subset of 5 disks are responding faster (which also leads to smaller actv queues and so lower service times) than the 7 others. and the slow ones are subject to more writes...haha. If the sizes of the luns are different (or have different amount of free blocks) then maybe ZFS is now trying to rebalance free space by targetting a subset of the disks with more new data. Pool throughput will be impacted by this. -r Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] UFS on zvol Cache Questions...
Priming the cache for ZFS should work at least after boot When freemem is large; any read block will make it to cache. Post boot when memory is primed with something else (what?) then it gets more difficult for both UFS and ZFS to guess what to keep in caches. Did you try priming ZFS after boot ? So next you seem to suffer because your sequential write to log files appear to displace from the ARC the more useful DB files (I'd be interested to see if this still occur after you've primed the ZFS cache after boot). Note that if your logfile rate is huge (dd like) then ZFS cache management will suffer but that is well on it's way to being fixed. But for DS, I would think that the log rate would be more reasonable and that your storage is able to keep up. That gives ZFS cache management a fighting chance to store the reused data over sequential writes. If the default behavior is not working for you, we'll need to consider the ARC behavior in this case. I don't see why it should not work out of the box. But manual control will come also in the form of this DIO like feature : 6429855 Need way to tell ZFS that caching is a lost cause While we manage to try and solve your problem out the box, you might also have a background process that keeps priming the cache at a low I/O rate. Not a great workaround, but should be effective. -r Brad Diggs writes: Hello Darren, Please find responses in line below... On Fri, 2008-02-08 at 10:52 +, Darren J Moffat wrote: Brad Diggs wrote: I would like to use ZFS but with ZFS I cannot prime the cache and I don't have the ability to control what is in the cache (e.g. like with the directio UFS option). Why do you believe you need that at all ? My application is directory server. The #1 resource that directory needs to make maximum utilization of is RAM. In order to do that, I want to control every aspect of RAM utilization both to safely use as much RAM as possible AND avoid contention among things trying to use RAM. Lets consider the following example. A customer has a 50M entry directory. The sum of the data (db3 files) is approximately 60GB. However, there is another 2GB for the root filesystem, 30GB for the changelog, 1GB for the transaction logs, and 10GB for the informational logs. The system on which directory server will run has only 64GB of RAM. The system is configured with the following partitions: FS Used(GB) Description / 2 root /db60directory data /logs 41changelog, txn logs, and info logs swap 10system swap I prefer to keep the directory db cache and entry caches relatively small. So the db cache is 2GB and the entry cache is 100M. This leaves roughly 63GB of RAM for my 60GB of directory data and Solaris. The only way to ensure that the directory data (/db) is the only thing in the filesystem cache is to set directio on / (root) and (/logs). What do you do to prime the cache with UFS cd ds_instance_dir/db for i in `find . -name '*.db3` do dd if=${i} of=/dev/null done and what benefit do you think it is giving you ? Priming the directory server data into filesystem cache reduces ldap response time for directory data in the filesystem cache. This could mean the difference between a sub ms response time and a response time on the order of tens or hundreds of ms depending on the underlying storage speed. For telcos in particular, minimal response time is paramount. Another common scenario is when we do benchmark bakeoffs with another vendor's product. If the data isn't pre- primed, then ldap response time and throughput will be artificially degraded until the data is primed into either the filesystem or directory (db or entry) cache. Priming via ldap operations can take many hours or even days depending on the number of entries in the directory server. However, priming the same data via dd takes minutes to hours depending on the size of the files. As you know in benchmarking scenarios, time is the most limited resource that we typically have. Thus, priming via dd is much preferred. Lastly, in order to achieve optimal use of available RAM, we use directio for the root (/) and other non-data filesystems. This makes certain that the only data in the filesystem cache is the directory data. Have you tried just using ZFS and found it doesn't perform as you need or are you assuming it won't because it doesn't have directio ? We have done extensive testing with ZFS and love it. The three areas lacking for our use cases are as follows: * No ability to control what is in cache. e.g. no directio * No absolute ability to apply an upper boundary to the amount of RAM consumed by ZFS. I know that the arc cache has a control that
Re: [zfs-discuss] simulating directio on zfs?
Andrew Robb writes: The big problem that I have with non-directio is that buffering delays program execution. When reading/writing files that are many times larger than RAM without directio, it is very apparent that system response drops through the floor- it can take several minutes for an ssh login to prompt for a password. This is true both for UFS and ZFS. Appart from the ZFS write, I find this a bit surprising. Are you sure this is a general statement or would detail of your configuration be of interest here. As for the ZFS write, this problem is well on it's way to being fixed. 6429205 each zpool needs to monitor it's throughput and throttle heavy writers Now that you say this, I think I can see how it would be possible in the UFS write case also. Both Read cases I find troublesome though. See below on disable speculative reads. For ZFS, I guess some relief might come from implementing something like this : 6429855 Need way to tell ZFS that caching is a lost cause Repeat the exercise with directio on UFS and there is no discernible delay in starting new applications (ssh login takes a second or so to prompt for a password). Writing a large file might appear to take a few seconds longer with directio, but add a sync command to the end and it is apparent that there is no real delay in getting the data to disc with directio. I'd like to see directio() provide some of the facilities under ZFS that it affords under UFS: 1. data is not buffered beyond the I/O request 2. no speculative reads 3. synchronous writes of whole records 4. concurrent I/O (which is already available in ZFS) So I think we should not confuse UFS directio with synchronous semantics. So I think point 3 comes from a confusion. For point 2, we've not too long ago fixed one level of speculative reads (vdev) which should not cause problems anymore. The other level (zfetch) needs attention. I see no reason that good software can't work out of the box for you. In the mean time it is possible to disable speculative reads as described here : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching For point 1, I think the current behavior of ZFS is far from good. Once the problems are fixed though it will be time to reevaluate if the buffering strategies are causing problems and under what conditions. Your memory BW issues could be one of them. Note: I consider memory bandwidth a finite and precious resource - not to be wasted in double-buffering. (I have a naive test program that is completely bound by main memory bandwidth - two programs on two CPUs run at half the speed of a single program on one CPU.) But also consider you # disks and system bus bandwidth in this equation. Many workloads won't be hit by the double memory BW if there are spindle limited. A lot of the longing for Directio come from a few serious quirks in the current implementation. You've had legitimate issues in UFS and ZFS and the UFS issues happened to be fixed by UFS/DIO; I think many of them can be fixed in ZFS with something not called Directio. -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL controls in Solaris 10 U4?
Jonathan Loran writes: Is it true that Solaris 10 u4 does not have any of the nice ZIL controls that exist in the various recent Open Solaris flavors? I would like to move my ZIL to solid state storage, but I fear I can't do it until I have another update. Heck, I would be happy to just be able to turn the ZIL off to see how my NFS on ZFS performance is effected before spending the $'s. Anyone know when will we see this in Solaris 10? You can certainly turn it off with any release (Jim's link). It's true that S10u4 does not have the Separate Intent Log to allow using an SSD for ZIL blocks. I believe S10U5 will have that feature. As noted, disabling the ZIL won't lead to ZFS pool corruption, just DBcorruption(that includes NFS clients). To protect against that, in the event of a server crash with zil_disable=1, you'd need to reboot all NFS clients of the server (clear the client's caches) and better do this before the server comes back up (kind of a raw proposition here). -r Thanks, Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vq_max_pending value ?
Manoj Nayak writes: Hi All. ZFS document says ZFS schedules it's I/O in such way that it manages to saturate a single disk bandwidth using enough concurrent 128K I/O. The no of concurrent I/O is decided by vq_max_pending.The default value for vq_max_pending is 35. We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS record size is set to 128k.When we read/write a 128K record ,it issue a 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. We need to saturate all three data disk bandwidth in the Raidz group.Is it required to set vq_max_pending value to 35*3=135 ? Nope. Once a disk controller is working on 35 requests, we don't expect to get any more out of it by queueing more requests and we might even confuse the firmware and get less. Now for an array controller and a vdev fronting for large number of disks, then 35 might be a low number not allowing full throughput. Ratherthan tuning 35 up,we suggest splitting devives into smaller LUNs since each luns is given a 35-deep queue. Tuning vq_max_pending down helps read and synchronous write (ZIL) latency. Today the preferred way to help ZIL latency is to use a Separate Intent Log. -r Thanks Manoj Nayak ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vdev_cache
Manoj Nayak writes: Hi All, If any dtrace script is available to figure out the vdev_cache (or software track buffer) reads in kiloBytes ? The document says the default size of the read is 128k , However vdev_cache source code implementation says the default size is 64k Thanks Manoj Nayak Which document ? It's 64K when it applies. Nevada won't use the vdev_cache for data block anymore. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS recordsize
Why do you want greater than 128K records. Do Check out : http://blogs.sun.com/roch/entry/128k_suffice -r Manoj Nayak writes: Hi All, Is it not poosible to increase zfs record size beyond 128k.I am using Solaris 10 Update 4. I get following error when I try to set zfs record size to 1024 k. zfs set recordsize=1024k md9/test cannot set property for 'md9/test': 'recordsize' must be power of 2 from 512 to 128k Thanks Manoj Nayak ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS recordsize
Manoj Nayak writes: Roch - PAE wrote: Why do you want greater than 128K records. A single-parity RAID-Z pool on thumper is created it consists of four disk.Solaris 10 update 4 runs on thumper.Then zfs filesystem is created in the pool.1 mb data is written to a file in filesystem using write (2) system call.However dtrace displays too many small sized physical disk read when the same 1 mb data is read using read() system call.recordsize is set to 128k. What's going on here ? why so many small size block read is done ? Each record becomes it's own raid-z stripe. When you read a 128K record it will need to issue a 128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group. But with prefetching it's possible that the I/O scheduler aggregate this to a higher value. It seems this is not happening on your setup. How large are the I/Os, if smaller than the above, is the pool rather filled up ? More reading : http://blogs.sun.com/roch/entry/when_to_and_not_to (raidz) -r if 0LARGE option need to be mentioned at the time of creating this file to tell , so that zfs use bigger block size. I think, zfs does use block size as per the following statements. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K. dtrace output : Event Device Device PathName RW Block Size Block No Offset Path sc-read. R 1052672 0 /mnt/bank0/media/CAD1/4.1 fop_read . R 1052672 0 /mnt/bank0/media/CAD1/4.1 disk_io sd6 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R65536 50816 0 none disk_io sd21 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R65536 50816 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R65536 50816 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R 131072 47839 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 48095 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R43520 48352 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R43520 48523 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 48950 0 none disk_io sd48 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 49121 0 none disk_io sd6 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R 131072 48096 0 none disk_io sd6 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 48352 0 none disk_io sd21 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R43520 48267 0 none disk_io sd21 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R43520 48438 0 none disk_io sd6 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 48523 0 none disk_io sd6 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R43520 48951 0 none disk_io sd21 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 49891 0 none disk_io sd21 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R44032 50062 0 none disk_io sd6 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL PROTECTED]/[EMAIL PROTECTED],0:a R87552 49378 0 none disk_io sd13 /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL
Re: [zfs-discuss] JBOD performance
Frank Penczek writes: Hi, On Dec 17, 2007 4:18 PM, Roch - PAE [EMAIL PROTECTED] wrote: The pool holds home directories so small sequential writes to one large file present one of a few interesting use cases. Can you be more specific here ? Do you have a body of application that would do small sequential writes; or one in particular ? Another interesting info is if we expect those to be allocating writes or overwrite (beware that some app, move the old file out, then run allocating writes, then unlink the original file). Sorry, I try to be more specific. The zpool contains home directories that are exported to client machines. It is hard to predict what exactly users are doing, but one thing users do for certain is checking out software projects from our subversion server. The projects typically contain many source code files (thousands) and a build process accesses all of them in the worst case. That is what I meant by many (small) files like compiling projects in my previous post. The performance for this case is ... hopefully improvable. This we'll have to work on. But first, If this is to Storage with NVRAM, I assume you checked that the storage does not flush it's caches : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes If that is not your problem and if ZFS underperform another FS on the backend of NFS, then this needs investigation. If ZFS/NFS underperformance a direct attach FS that might just be an NFS issue not related to ZFS. Again that needs investigation. Performance gains won't happen unless we find out what doesn't work. Now for sequential writes: We don't have a specific application issuing sequential writes but I can think of at least a few cases where these writes may occur, e.g. dumps of substantial amounts of measurement data or growing log files of applications. In either case these would be mainly allocating writes. Right but I'd hope the application would issue substantially large writes specially if it' needs to dump data at high rate. If the data rate is more modest, then the CPU lost to this effect will itself be modest. Does this provide the information you're interested in? I get a sense that it's more important we find out what is your build issue is. But the small writes will have to be improved one day also. -r Cheers, Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
dd uses a default block size of 512B. Does this map to your expected usage ? When I quickly tested the CPU cost of small read from cache, I did see that ZFS was more costly than UFS up to a crossover between 8K and 16K. We might need a more comprehensive study of that (data in/out of cache, different recordsize alignment constraints ). But for small syscalls, I think we might need some work in ZFS to make it CPU efficient. So first, does small sequential writeto a large file, matches an interesting use case ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD performance
Frank Penczek writes: Hi, On Dec 17, 2007 10:37 AM, Roch - PAE [EMAIL PROTECTED] wrote: dd uses a default block size of 512B. Does this map to your expected usage ? When I quickly tested the CPU cost of small read from cache, I did see that ZFS was more costly than UFS up to a crossover between 8K and 16K. We might need a more comprehensive study of that (data in/out of cache, different recordsize alignment constraints ). But for small syscalls, I think we might need some work in ZFS to make it CPU efficient. So first, does small sequential writeto a large file, matches an interesting use case ? The pool holds home directories so small sequential writes to one large file present one of a few interesting use cases. Can you be more specific here ? Do you have a body of application that would do small sequential writes; or one in particular ? Another interesting info is if we expect those to be allocating writes or overwrite (beware that some app, move the old file out, then run allocating writes, then unlink the original file). The performance is equally disappointing for many (small) files like compiling projects in svn repositories. ??? -r Cheers, Frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Odd prioritisation issues.
Dickon Hood writes: On Fri, Dec 07, 2007 at 13:14:56 +, I wrote: : On Fri, Dec 07, 2007 at 12:58:17 +, Darren J Moffat wrote: : : Dickon Hood wrote: : : On Fri, Dec 07, 2007 at 12:38:11 +, Darren J Moffat wrote: : : : Dickon Hood wrote: : : : We're seeing the writes stall in favour of the reads. For normal : : : workloads I can understand the reasons, but I was under the impression : : : that real-time processes essentially trump all others, and I'm surprised : : : by this behaviour; I had a dozen or so RT-processes sat waiting for disc : : : for about 20s. : : : Are the files opened with O_DSYNC or does the application call fsync ? : : No. O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND. Would that help? : : Don't know if it will help, but it will be different :-). I suspected : : that since you put the processes in the RT class you would also be doing : : synchronous writes. : Right. I'll let you know on Monday; I'll need to restart it in the : morning. I was a tad busy yesterday and didn't have the time, but I've switched one of our recorder processes (the one doing the HD stream; ~17Mb/s, broadcasting a preview we don't mind trashing) to a version of the code which opens its file O_DSYNC as suggested. We've gone from ~130 write ops per second and 10MB/s to ~450 write ops per second and 27MB/s, with a marginally higher CPU usage. This is roughly what I'd expect. We've artifically throttled the reads, which has helped (but not fixed; it isn't as determinative as we'd like) the starvation problem at the expense of increasing a latency we'd rather have as close to zero as possible. Any ideas? O_DSYNC was good idea. Then if you have recent Nevada you can use the separate intent log (log keyword in zpool create) to absord thosewrites without having splindle competition with the reads. Your write workload should then be well handled here (unless the incoming network processing is itself delayed). -r Thanks. -- Dickon Hood Due to digital rights management, my .sig is temporarily unavailable. Normal service will be resumed as soon as possible. We apologise for the inconvenience in the meantime. No virus was found in this outgoing message as I didn't bother looking. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lack of physical memory evidences
Dmitry Degrave writes: In pre-ZFS era, we had observable parameters like scan rate and anonymous page-in/-out counters to discover situations when a system experiences a lack of physical memory. With ZFS, it's difficult to use mentioned parameters to figure out situations like that. Has someone any idea what we can use for the same purpose now ? Those should still work. What prevents them from being effective markers today is that no matter how much memory you have, a write heavy workload (one that dirties data faster than disk drain) will consume whatever you have and trigger the markers. If you don't have the above problem, then anonymous paging is a good sign of memory shortage. -r Thanks in advance, Dmeetry This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Moore, Joe writes: Louwtjie Burger wrote: Richard Elling wrote: - COW probably makes that conflict worse This needs to be proven with a reproducible, real-world workload before it makes sense to try to solve it. After all, if we cannot measure where we are, how can we prove that we've improved? I agree, let's first find a reproducible example where updates negatively impacts large table scans ... one that is rather simple (if there is one) to reproduce and then work from there. I'd say it would be possible to define a reproducible workload that demonstrates this using the Filebench tool... I haven't worked with it much (maybe over the holidays I'll be able to do this), but I think a workload like: 1) create a large file (bigger than main memory) on an empty ZFS pool. 2) time a sequential scan of the file 3) random write i/o over say, 50% of the file (either with or without matching blocksize) 4) time a sequential scan of the file The difference between times 2 and 4 are the penalty that COW block reordering (which may introduce seemingly-random seeks between sequential blocks) imposes on the system. But it's not the only thing. The difference between 2 and 4 is the COW penalty that one can hide under prefetching and many spindles. The other thing is to see what is the impact (throughput and response time) of the file scan operation to the ever going random write load. Third is the impact on CPU cycles required to do the filescans. -r It would be interesting to watch seeksize.d's output during this run too. --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)
Neil Perrin writes: Joe Little wrote: On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote: Joe, I don't think adding a slog helped in this case. In fact I believe it made performance worse. Previously the ZIL would be spread out over all devices but now all synchronous traffic is directed at one device (and everything is synchronous in NFS). Mind you 15MB/s seems a bit on the slow side - especially is cache flushing is disabled. It would be interesting to see what all the threads are waiting on. I think the problem maybe that everything is backed up waiting to start a transaction because the txg train is slow due to NFS requiring the ZIL to push everything synchronously. I agree completely. The log (even though slow) was an attempt to isolate writes away from the pool. I guess the question is how to provide for async access for NFS. We may have 16, 32 or whatever threads, but if a single writer keeps the ZIL pegged and prohibiting reads, its all for nought. Is there anyway to tune/configure the ZFS/NFS combination to balance reads/writes to not starve one for the other. Its either feast or famine or so tests have shown. No there's no way currently to give reads preference over writes. All transactions get equal priority to enter a transaction group. Three txgs can be outstanding as we use a 3 phase commit model: open; quiescing; and syncing. That makes me wonder if this is not just the lack of write throttling issue. If one txg is syncing and the other is quiesced out, I think it means we have let in too many writes. We do need a better balance. Neil is it correct that reads never hit txg_wait_open(), but they just need an I/O scheduler slot ? If so seems to me just a matter of 6429205 each zpool needs to monitor it's throughput and throttle heavy writers However, if this is it, disabling the zil would not solve the issue (it might even make it worse). So I am lost as to what could be blocking the reads other than lack of I/O slots. As another way to improve I/O scheduler we have : 6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops -r Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + fragments
Louwtjie Burger writes: Hi After a clean database load a database would (should?) look like this, if a random stab at the data is taken... [8KB-m][8KB-n][8KB-o][8KB-p]... The data should be fairly (100%) sequential in layout ... after some days though that same spot (using ZFS) would problably look like: [8KB-m][ ][8KB-o][ ] Is this pseudo logical-physical view correct (if blocks n and p was updated and with COW relocated somewhere else)? That's the proper view if the ZFS recordsize is tuned to be 8KB. That's a best practice that might need to be qualified in the future. Could a utility be constructed to show the level of fragmentation ? (50% in above example) That will need to dive into the internals of ZFS. But anything is possible. It's been done for UFS before. IF the above theory is flawed... how would fragmentation look/be observed/calculated under ZFS with large Oracle tablespaces? Does it even matter what the fragmentation is from a performance perspective? It matters to table scans and how those scans will impact OLTP workloads. Good blog topic. Stay tune. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + DB + default blocksize
Louwtjie Burger writes: Hi What is the impact of not aligning the DB blocksize (16K) with ZFS, especially when it comes to random reads on single HW RAID LUN. How would one go about measuring the impact (if any) on the workload? The DB will have a bigger in memory footprint as you will need to keep the ZFS record for the lifespan of the DB block. This probably means you want to partition memory between DB cache/ZFS ARC cache according to the ratio of DB blocksize/ZFS recordize. Then I imagine you have multiple spindles associated with the lun. If you're lun is capable of 2000 IOPS over a 200MB/sec data channel then during 1 second at full speed : 2000 IOPS * 16K = 32MB of data transfer, and this fits in the channel capability. But using say a ZFS blocks of 128K then 2000 IOPS * 128K = 256MB, which overload the channel. So in this example the data channel would saturate first preventing you from reaching those 2000 IOPS. But with enough memory and data channel throughput then it's a good idea to keep the ZFS recordize large. -r Thank you ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unreasonably high sys utilization during file create operations.
Was that with compression enabled ? Got zpool status output ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] [Fwd: [Fwd: MySQL benchmark]]
Original Message Subject: [zfs-discuss] MySQL benchmark Date: Tue, 30 Oct 2007 00:32:43 + From: Robert Milkowski [EMAIL PROTECTED] Reply-To: Robert Milkowski [EMAIL PROTECTED] Organization: CI TASK http://www.task.gda.pl To: zfs-discuss@opensolaris.org Hello zfs-discuss, http://dev.mysql.com/tech-resources/articles/mysql-zfs.html I've just quickly glanced thru it. However the argument about double buffering problem is not valid. -- Best regards, Robert Milkowski mailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss --- end of forwarded message --- I absolutely agree with Robert here. Data is cached once in the Database and, absent directio, _some_ extra memory is required to stage the I/Os. On read it's a tiny amount since memory can be reclaimed as soon as it's copied to user space. 1 threads each waiting for 8K will serviced using 80M of extra memory. On the write path we need to stage the data for the purpose of a ZFS transaction group. When the dust settles we will be able to do this every 5 seconds. So what percentage of DB blocks are modified in 5 to 10 seconds ? If the answer is 5% then yes, the lack of directio is a concern for you. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL reliability/replication questions
This should work. It shouldn't even lose the in-flight transactions. ZFS reverts to using the main pool if a slog write fails or the slog fills up. So, the only way to lose transactions would be a crash or power loss, leaving outstanding transactions in the log, followed by the log device failing to start up on reboot? I assume that that would that be handled relatively cleanly (files have out of data data), as opposed to something nasty like the pool fails to start up. It's just data loss from zpool perspective. However it's data loss from application commited data. So applications that relied on commiting data for their own consistency might end up with a corrupted view of the world. NFS clients fall in this bin. Mirroring the NVRAM cards in the Separate Intent log seems like a very good idea. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sequential reading/writting from large stripe faster on SVM than ZFS?
I would suspect the checksum part of this (I do believe it's being actively worked on) : 6533726 single-threaded checksum raidz2 parity calculations limit write bandwidth on thumper -r Robert Milkowski writes: Hi, snv_74, x4500, 48x 500GB, 16GB RAM, 2x dual core # zpool create test c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0 c0t7d0 c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0 c1t6d0 c1t7d0 c4t0d0 c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0 c4t6d0 c4t7d0 c5t1d0 c5t2d0 c5t3d0 c5t5d0 c5t6d0 c5t7d0 c6t0d0 c6t1d0 c6t2d0 c6t3d0 c6t4d0 c6t5d0 c6t6d0 c6t7d0 c7t0d0 c7t1d0 c7t2d0 c7t3d0 c7t4d0 c7t5d0 c7t6d0 c7t7d0 [46x 500GB] # ls -lh /test/q1 -rw-r--r-- 1 root root 82G Oct 18 09:43 /test/q1 # dd if=/test/q1 of=/dev/null bs=16384k # zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test 213G 20.6T645120 80.1M 14.7M test 213G 20.6T 9.26K 0 1.16G 0 test 213G 20.6T 9.66K 0 1.21G 0 test 213G 20.6T 9.41K 0 1.18G 0 test 213G 20.6T 9.41K 0 1.18G 0 test 213G 20.6T 7.45K 0 953M 0 test 213G 20.6T 7.59K 0 971M 0 test 213G 20.6T 7.41K 0 948M 0 test 213G 20.6T 8.25K 0 1.03G 0 test 213G 20.6T 9.17K 0 1.15G 0 test 213G 20.6T 9.54K 0 1.19G 0 test 213G 20.6T 9.89K 0 1.24G 0 test 213G 20.6T 9.41K 0 1.18G 0 test 213G 20.6T 9.31K 0 1.16G 0 test 213G 20.6T 9.80K 0 1.22G 0 test 213G 20.6T 8.72K 0 1.09G 0 test 213G 20.6T 7.86K 0 1006M 0 test 213G 20.6T 7.21K 0 923M 0 test 213G 20.6T 7.62K 0 975M 0 test 213G 20.6T 8.68K 0 1.08G 0 test 213G 20.6T 9.81K 0 1.23G 0 test 213G 20.6T 9.57K 0 1.20G 0 So it's around 1GB/s. # dd if=/dev/zero of=/test/q10 bs=128k # zpool iostat 1 capacity operationsbandwidth pool used avail read write read write -- - - - - - - test 223G 20.6T656170 81.5M 20.8M test 223G 20.6T 0 8.10K 0 1021M test 223G 20.6T 0 7.94K 0 1001M test 216G 20.6T 0 6.53K 0 812M test 216G 20.6T 0 7.19K 0 906M test 216G 20.6T 0 6.78K 0 854M test 216G 20.6T 0 7.88K 0 993M test 216G 20.6T 0 10.3K 0 1.27G test 222G 20.6T 0 8.61K 0 1.04G test 222G 20.6T 0 7.30K 0 919M test 222G 20.6T 0 8.16K 0 1.00G test 222G 20.6T 0 8.82K 0 1.09G test 225G 20.6T 0 4.19K 0 511M test 225G 20.6T 0 10.2K 0 1.26G test 225G 20.6T 0 9.15K 0 1.13G test 225G 20.6T 0 8.46K 0 1.04G test 225G 20.6T 0 8.48K 0 1.04G test 225G 20.6T 0 10.9K 0 1.33G test 231G 20.6T 0 3 0 3.96K test 231G 20.6T 0 0 0 0 test 231G 20.6T 0 0 0 0 test 231G 20.6T 0 9.02K 0 1.11G test 231G 20.6T 0 12.2K 0 1.50G test 231G 20.6T 0 9.14K 0 1.13G test 231G 20.6T 0 10.3K 0 1.27G test 231G 20.6T 0 9.08K 0 1.10G test 237G 20.6T 0 0 0 0 test 237G 20.6T 0 0 0 0 test 237G 20.6T 0 6.03K 0 760M test 237G 20.6T 0 9.18K 0 1.13G test 237G 20.6T 0 8.40K 0 1.03G test 237G 20.6T 0 8.45K 0 1.04G test 237G 20.6T 0 11.1K 0 1.36G Well, writing could be faster than reading here... there're gaps due to bug 6415647 I guess. # zpool destroy test # metainit d100 1 46 c0t0d0s0 c0t1d0s0 c0t2d0s0 c0t3d0s0 c0t4d0s0 c0t5d0s0 c0t6d0s0 c0t7d0s0 c1t0d0s0 c1t1d0s0 c1t2d0s0 c1t3d0s0 c1t4d0s0 c1t5d0s0 c1t6d0s0 c1t7d0s0 c4t0d0s0 c4t1d0s0 c4t2d0s0 c4t3d0s0 c4t4d0s0 c4t5d0s0 c4t6d0s0 c4t7d0s0 c5t1d0s0 c5t2d0s0 c5t3d0s0 c5t5d0s0 c5t6d0s0 c5t7d0s0 c6t0d0s0 c6t1d0s0 c6t2d0s0 c6t3d0s0 c6t4d0s0 c6t5d0s0 c6t6d0s0 c6t7d0s0 c7t0d0s0 c7t1d0s0 c7t2d0s0 c7t3d0s0 c7t4d0s0 c7t5d0s0 c7t6d0s0 c7t7d0s0 -i 128k d100: Concat/Stripe is setup [46x 500GB] And I get not so good results - maximum 1GB of reading... h... maxphys is 56K - I thought
Re: [zfs-discuss] Direct I/O ability with zfs?
Jim Mauro writes: Where does the win come from with directI/O? Is it 1), 2), or some combination? If its a combination, what's the percentage of each towards the win? That will vary based on workload (I know, you already knew that ... :^). Decomposing the performance win between what is gained as a result of single writer lock breakup and no caching is something we can only guess at, because, at least for UFS, you can't do just one - it's all or nothing. We need to tease 1) and 2) apart to have a full understanding. We can't. We can only guess (for UFS). My opinion - it's a must-have for ZFS if we're going to get serious attention in the database space. I'll bet dollars-to-donuts that, over the next several years, we'll burn many tens-of-millions of dollars on customer support escalations that come down to memory utilization issues and contention between database specific buffering and the ARC. This is entirely my opinion (not that of Sun), ...memory utilisation... OK so we should implement the 'lost cause' rfe. In all cases, ZFS must not steal pages from other memory consumers : 6488341 ZFS should avoiding growing the ARC into trouble So the DB memory pages should not be _contented_ for. -r and I've been wrong before. Thanks, /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
eric kustarz writes: Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I'm surprised that this is being met with skepticism considering that Oracle highly recommends direct IO be used, and, IIRC, Oracle performance was the main motivation to adding DIO to UFS back in Solaris 2.6. This isn't a problem with ZFS or any specific fs per se, it's the buffer caching they all employ. So I'm a big fan of seeing 6429855 come to fruition. The point is that directI/O typically means two things: 1) concurrent I/O 2) no caching at the file system In my blog I also mention : 3) no readahead (but can be viewed as an implicit consequence of 2) And someone chimed in with 4) ability to do I/O at the sector granularity. I also think that for many 2) is too weak form of what they expect : 5) DMA straight from user buffer to disk avoiding a copy. So 1) concurrent I/O we have in ZFS. 2) No Caching. we could do by taking a directio hint and evict arc buffer immediately after copyout to user space for reads, and after txg completion for writes. 3) No prefetching. we have 2 level of prefetching. The low level was fixed recently. Should not cause problem to DB loads. The high level still needs fixing on it's own. Then we should take the same hint as 2) to disable it altogether. In the mean time we can tune our way into this mode. 4) Sector sized I/O Is really foreign to ZFS design. 5) Zero Copy more CPU efficientcy. I think is where the debate is. My line has been that 5) won't help latency much and latency is where I think the game is currently played. Now the disconnect might be because people might feel that the game is not latency but CPU efficientcy : how many CPU cycles to I burn to do get data from disk to user buffer. This is a valid point. Configurations can with very large number of disks end up saturated by the filesystem CPU utilisation. So I still think that the major area for ZFS perf gains are on the latency front : block allocation (now much improved with the Separate intent log), I/O scheduling, and other fixes to the threading ARC behavior. But at some point we can turn our microscope onthe CPU efficientcy of the implementation. The copy will certainly be a big chunk of the CPU cost per I/O but I would still like to gather that data. Also consider, 50 disks at 200 IOPS of 8K is 80 MB/sec. That means maybe 1/10th of a single CPU to be saved by avoiding just the copy. Probably not what people have in mind. How many CPU's do you have when attaching 1000 drives to a host running a 100TB database ? That many drivers will barely occupy 2 cores running the copies. People want performance and efficientcy. Directio is just an overloaded name that delivered those gains to other filesystems. Right now, what I think is worth gathering is cycles spent in ZFS per reads writes in a large DB environment where DB holds 90% of memory. For comparison with another FS, we should disable checksum, file prefetching, vdev prefetching, cap the ARC, atime off, 8K recordsize. A breakdown and comparison of the CPU cost per layer will be quite interesting and points to what needs work. Another interesting thing for me would be : what is your budget ? how many cycles per DB reads and writes are you willing to spend and how did you come to that number But, as Eric says, let's develop 2 and I'll try in parallel to figure out the per layer breakdown cost. -r Most file systems (ufs, vxfs, etc.) don't do 1) or 2) without turning on directI/O. ZFS *does* 1. It doesn't do 2 (currently). That is what we're trying to discuss here. Where does the win come from with directI/O? Is it 1), 2), or some combination? If its a combination, what's the percentage of each towards the win? We need to tease 1) and 2) apart to have a full understanding. I'm not against adding 2) to ZFS but want more information. I suppose i'll just prototype it and find out for myself. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Nicolas Williams writes: On Thu, Oct 04, 2007 at 03:49:12PM +0200, Roch - PAE wrote: ...memory utilisation... OK so we should implement the 'lost cause' rfe. In all cases, ZFS must not steal pages from other memory consumers : 6488341 ZFS should avoiding growing the ARC into trouble So the DB memory pages should not be _contented_ for. What if your executable text, and pretty much everything lives on ZFS? You don't want to content for the memory caching those things either. It's not just the DB's memory you don't want to contend for. On the read side, We're talking here about 1000 disks each running35 concurrent I/Os of 8K, so a footprint of 250MB, to stage a ton of work. On the write side we do have to play with the transaction group so that will be 5-10 seconds worth of synchronous write activity. But how much memory does a 1000-disks server got ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Nicolas Williams writes: On Wed, Oct 03, 2007 at 04:31:01PM +0200, Roch - PAE wrote: It does, which leads to the core problem. Why do we have to store the exact same data twice in memory (i.e., once in the ARC, and once in the shared memory segment that Oracle uses)? We do not retain 2 copies of the same data. If the DB cache is made large enough to consume most of memory, the ZFS copy will quickly be evicted to stage other I/Os on their way to the DB cache. What problem does that pose ? Other things deserving of staying in the cache get pushed out by things that don't deserve being in the cache. Thus systemic memory pressure (e.g., more on-demand paging of text). Nico -- I agree. That's why I submitted both of these. 6429855 Need way to tell ZFS that caching is a lost cause 6488341 ZFS should avoiding growing the ARC into trouble -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Rayson Ho writes: 1) Modern DBMSs cache database pages in their own buffer pool because it is less expensive than to access data from the OS. (IIRC, MySQL's MyISAM is the only one that relies on the FS cache, but a lot of MySQL sites use INNODB which has its own buffer pool) The DB can and should cache data whether or not directio is used. 2) Also, direct I/O is faster because it avoid double buffering. A piece of data can be in one buffer, 2 buffers, 3 buffers. That says nothing about performance. More below. So I guess you mean DIO is faster because it avoids the extra copy: dma straight to user buffer rather than DMA to kernel buffer then copy to user buffer. If an I/O is 5ms an 8K copy is about 10 usec. Is avoiding the copy really the most urgent thing to work on ? Rayson On 10/2/07, eric kustarz [EMAIL PROTECTED] wrote: Not yet, see: 6429855 Need way to tell ZFS that caching is a lost cause Is there a specific reason why you need to do the caching at the DB level instead of the file system? I'm really curious as i've got conflicting data on why people do this. If i get more data on real reasons on why we shouldn't cache at the file system, then this could get bumped up in my priority queue. I can't answer this although can well imagine that the DB is the most efficent place to cache it's own data all organised and formatted to respond to queries. But once the DB has signified to the FS that it doesn't require the FS to cache data then the benefit from this RFE is that the memory used to stage the data can be quickly recycled by ZFS for subsequent operations. It means ZFS memory footprint is more likely to containuseful ZFS metadata and not cached data block we know are not likely to be used again anytime soon. We also would operated better in mixed DIO/non-DIO workloads. See also: http://blogs.sun.com/roch/entry/zfs_and_directio -r eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Direct I/O ability with zfs?
Matty writes: On 10/3/07, Roch - PAE [EMAIL PROTECTED] wrote: Rayson Ho writes: 1) Modern DBMSs cache database pages in their own buffer pool because it is less expensive than to access data from the OS. (IIRC, MySQL's MyISAM is the only one that relies on the FS cache, but a lot of MySQL sites use INNODB which has its own buffer pool) The DB can and should cache data whether or not directio is used. It does, which leads to the core problem. Why do we have to store the exact same data twice in memory (i.e., once in the ARC, and once in the shared memory segment that Oracle uses)? We do not retain 2 copies of the same data. If the DB cache is made large enough to consume most of memory, the ZFS copy will quickly be evicted to stage other I/Os on their way to the DB cache. What problem does that pose ? -r Thanks, - Ryan -- UNIX Administrator http://prefetch.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS array NVRAM cache?
Vincent Fox writes: I don't understand. How do you setup one LUN that has all of the NVRAM on the array dedicated to it I'm pretty familiar with 3510 and 3310. Forgive me for being a bit thick here, but can you be more specific for the n00b? Do you mean from firmware side or OS side? Or since the LUNs used for the ZIL are separated out from the other disks in the pool they DO get to make use of the NVRAM, is that it? I have a pair of 3310 with 12 36-gig disks for testing. I have a V240 with PCI dual-SCSI controller so I can drive one array from each port is what I am tinkering with right now. Looking for maximum reliability/redundancy of course so I would ZFS mirror the arrays. Can you suggest a setup here? A single-disk from each array exported as a LUN, then ZFS mirrored together for the ZIL log? An example would be helpful. Could I then just lump all the remaining disks into a 10-disk RAID-5 LUN, mirror them together and achieve a significant performance improvement? Still have to have a global spare of course in the HW RAID. What about sparing for the ZIL? With PSARC 2007/171 ZFS Separate Intent Log now in Nevada, you can setup the ZIL on it's own set of (possibly very fast) luns. The luns can be mirrored if you have more than one NVRAM cards. http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on This will work great to accelerate JBOD using just a small amount of NVRAM for the ZIL. When a storage is fronted 100% by NVRAM the benefits of the slog won't be as large. Last week we had this putback : PSARC 2007/053 Per-Disk-Device support of non-volatile cache 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices which will preventsome recognised Arrays from doing unnecessary cache flushes and allow tuning others using sd.conf. Otherwise arrays will be queried for support of the SYNC_NV capability. IMO, the best is to bug storage vendors into supporting SYNC_NV. For earlier releases, to get the full benefit of the NVRAM on zil operations you are stuck into a raw tuning proposition : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH -r See also : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] io:::start and zfs filenames?
Neelakanth Nadgir writes: io:::start probe does not seem to get zfs filenames in args[2]-fi_pathname. Any ideas how to get this info? -neel Who says an I/O is doing work for a single pathname/vnode or for a single process. There is not that one to one correspondance anymore. Not in the ZIL and not in the transaction groups due to I/O aggregations. As for mmaped I/O, follow Jim's advice, I guess fsflush will be issueing some putpage : fsinfo genunix fop_putpage putpage -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ARC DNLC Limitation
Hi Jason, This should have helped. 6542676 ARC needs to track meta-data memory overhead Some of the lines to arc.c: 1551 1.36 if (arc_meta_used = arc_meta_limit) { 1552/* 1553 * We are exceeding our meta-data cache limit. 1554 * Purge some DNLC entries to release holds on meta-data. 1555 */ 1556dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent); 1557} -r Jason J. W. Williams writes: Hello All, Awhile back (Feb '07) when we noticed ZFS was hogging all the memory on the system, y'all were kind enough to help us use the arc_max tunable to attempt to limit that usage to a hard value. Unfortunately, at the time a sticky problem was that the hard limit did not include DNLC entries generated by ZFS. I've been watching the list since then and trying to watch the Nevada commits. I haven't noticed that anything has been committed back so that arc_max truly enforces the max amount of memory ZFS is allowed to consume (including DNLC entries). Has this been corrected and I just missed it? Thank you in advance for you any help. Best Regards, Jason ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS (and quota)
Pawel Jakub Dawidek writes: I'm CCing zfs-discuss@opensolaris.org, as this doesn't look like FreeBSD-specific problem. It looks there is a problem with block allocation(?) when we are near quota limit. tank/foo dataset has quota set to 10m: Without quota: FreeBSD: # dd if=/dev/zero of=/tank/test bs=512 count=20480 time: 0.7s Solaris: # dd if=/dev/zero of=/tank/test bs=512 count=20480 time: 4.5s With quota: FreeBSD: # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480 dd: /tank/foo/test: Disc quota exceeded time: 306.5s Solaris: # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480 write: Disc quota exceeded time: 602.7s CPU is almost entirely idle, but disk activity seems to be high. Yes, as we are near quota limit, each transaction group will accept a small amount as to not overshoot the limit. I don't know if we have the optimal strategy yet. -r Any ideas? -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs and small files
Claus Guttesen writes: I have many small - mostly jpg - files where the original file is approx. 1 MB and the thumbnail generated is approx. 4 KB. The files are currently on vxfs. I have copied all files from one partition onto a zfs-ditto. The vxfs-partition occupies 401 GB and zfs 449 GB. Most files uploaded are in jpg and all thumbnails are always jpg. Is there a problem? Not by the diskusage itself. But if zfs takes up more space than vxfs (12 %) 80 TB will become 89 TB instead (our current storage) and add cost. Also, how are you measuring this (what commands)? I did a 'df -h'. Will a different volblocksize (during creation of the partition) make better use of the available diskspace? Will (meta)data require less space if compression is enabled? volblocksize won't have any affect on file systems, it is for zvols. Perhaps you mean recordsize? But recall that recordsize is the maximum limit, not the actual limit, which is decided dynamically. I read http://www.opensolaris.org/jive/thread.jspa?threadID=37673tstart=105 which is very similar to my case except for the file type. But no clear pointers otherwise. A good start would be to find the distribution of file sizes. The files are approx. 1 MB with an thumbnail of approx. 4 KB. So the 1 MB files are stored as ~8 x 128K recordsize. Because of 5003563 use smaller tail block for last block of object The last block of you file is partially used. It will depend on your filesize distribution by without that info we can only guess that we're wasting an avg of 64K per file. Or 6%. If your distribution is such that most files are slightly more than 1M, then we'd have 12% overhead from this effect. So using 16K/32K recordsize would quite possibly help as files would be stored using ~ 64 x 16K blocks with an overhead of 1-2% (0.5 blocks wasted every 64). -r -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs and small files
Claus Guttesen writes: So the 1 MB files are stored as ~8 x 128K recordsize. Because of 5003563 use smaller tail block for last block of object The last block of you file is partially used. It will depend on your filesize distribution by without that info we can only guess that we're wasting an avg of 64K per file. Or 6%. If your distribution is such that most files are slightly more than 1M, then we'd have 12% overhead from this effect. So using 16K/32K recordsize would quite possibly help as files would be stored using ~ 64 x 16K blocks with an overhead of 1-2% (0.5 blocks wasted every 64). I will (re)create the partition and modify the recordsize. I was unwilling to do so when I read the man page which discourages modifying this setting unless a database was used. Does zfs use suballocation if a file does not use an entire recordsize? If not the thumbnails probably wastes most space. They are approx. 4 KB. Files smaller than 'recordsize' are stored using a multiple of the sector size. So small files should not factor in this equation. I'll be testing recordsizes from 1K and upwards. Actually 1K made zfs very slow but 2K seems fine. I'll report back when the entire partition has been copied. When I find the sweet spot I'll try to enable (default) compression. Beware because at 2K you might be generating more indirect blocks. For 1MB files the gains from using a recordsize smaller than 16K start to be quite small. -r Thank you. -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about uberblock blkptr
[EMAIL PROTECTED] writes: Roch - PAE wrote: [EMAIL PROTECTED] writes: Jim Mauro wrote: Hey Max - Check out the on-disk specification document at http://opensolaris.org/os/community/zfs/docs/. Page 32 illustration shows the rootbp pointing to a dnode_phys_t object (the first member of a objset_phys_t data structure). The source code indicates ub_rootbp is a blkptr_t, which contains a 3 member array of dva_t 's called blk_dva (blk_dva[3]). Each dva_t is a 2 member array of 64-bit unsigned ints (dva_word[2]). So it looks like each blk_dva contains 3 128-bit DVA's You probably figured all this out alreadydid you try using a objset_phys_t to format the data? Thanks, /jim Ok. I think I know what's wrong. I think the information (most likely, a objset_phys_t) is compressed with lzjb compression. Is there a way to turn this entirely off (not just for file data, but for all meta data as well when a pool is created? Or do I need to figure out how to hack in the lzjb_decompress() function in my modified mdb? (Also, I figured out that zdb is already doing the left shift by 9 before dumping DVA values, for anyone following this...). Max, this might help (zfs_mdcomp_disable) : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#METACOMP Hi Roch, That would help, except it does not seem to work. I set zfs_mdcomp_disable to 1 with mdb, deleted the pool, recreated the pool, and zdb - still shows the rootbp in the uberblock_t to have the lzjb flag turned on. So I then added the variable to /etc/system, destroyed the pool, rebooted, recreated the pool, and still the same result. Also, my mdb shows the same thing for the uberblock_t rootbp blkptr data. I am running Nevada build 55b. I shall update the build I am running soon, but in the meantime I'll probably write a modified cmd_print() function for my (modified) mdb to handle (at least) lzjb compressed metadata. Also, I think the ZFS Evil Tuning Guide should be modified. It says this can be tuned for Solaris 10 11/06 and snv_52. I guess that means only those two releases. snv_55b has the variable, but it doesn't have an effect (at least on the uberblock_t rootbp meta-data). thanks for your help. max My bad. The tunable only affects indirect dbufs (so I guess only for large files). As you noted, other metadata is compressed unconditionaly(I guess from the use of ZIO_COMPRESS_LZJB in dmu_objset_open_impl). -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Evil Tuning Guide
Tuning should not be done in general and Best practices should be followed. So get very much acquainted with this first : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Then if you must, this could soothe or sting : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide So drive carefully. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Evil Tuning Guide
Pawel Jakub Dawidek writes: On Mon, Sep 17, 2007 at 03:40:05PM +0200, Roch - PAE wrote: Tuning should not be done in general and Best practices should be followed. So get very much acquainted with this first : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Then if you must, this could soothe or sting : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide So drive carefully. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss or application level corruption. However the ZFS pool integrity itself is NOT compromised by this tuning. Are you sure? Once you turn off flushing cache, how can you tell that your disk didn't reorder writes and uberblock was updated before new blocks were written? Will ZFS go the the previous blocks when the newest uberblock points at corrupted data? Good point. I'll fix this. I don't know if we look for alternate uberblock but even if we did, I guess the 'out of sync' can occur lower down the tree. -r -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Supporting recordsizes larger than 128K?
Matty writes: Are there any plans to support record sizes larger than 128k? We use ZFS file systems for disk staging on our backup servers (compression is a nice feature here), and we typically configure the disk staging process to read and write large blocks (typically 1MB or so). This reduces the number of I/Os that take place to our storage arrays, and our testing has shown that we can push considerably more I/O with 1MB+ block sizes. So other FS and raw devices clearly benefit from larger blocksize but the way ZFS schedule such I/Os, I don't expect anymore more throughput from bigger blocks. Maybe you're hitting something else that limits throughput ? -r Thanks for any insight, - Ryan -- UNIX Administrator http://prefetch.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS efficient for large collections of small files?
Brandorr wrote: Is ZFS efficient at handling huge populations of tiny-to-small files - for example, 20 million TIFF images in a collection, each between 5 and 500k in size? I am asking because I could have sworn that I read somewhere that it isn't, but I can't find the reference. If you're worried about the I/O throughput, you should avoid RAIDZ1/2 configurations. random read performance will be desastrous if you do; A raid-z group can do one random read per I/O latency. So for 8 disks (each capable of 200 IOPS) in a zpool split into 2 raid-z groups should be able to server 400 files per second. If you need to serve more files, then you need more disks or need to use mirroring. With mirroring, I'd expect to serve 1600 files (8*200). This model only applies to random reading, not sequential access, not to any types of write loads. For small file creation ZFS can be extremely efficient in that it can create more than 1 file per I/O. It should also approach disk streaming performance for write loads. I've seen random reads ratios with less than 1 MB/s on a X4500 with 40 dedicated disks for data storage. It would be nice to see if the above model matches your data. So if you have all 40 disks in a single raid-z group (an anti best practice) I'd expect 200 files served per second and if the files were of 5K avg size then I'd expect that 1MB/sec. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide, If you don't have to worry about disk space, use mirrors; right on ! I got my best results during my extensive X4500 benchmarking sessions, when I mirrored single slices instead of complete disks (resulting in 40 2-way-mirrors on 40 physical discs, mirroring c0t0d0s0-c0t1d0s1 and c0t1d0s0-c0t0d0s1, and so on). If you're worried about disk space, you should consider striping several instances of RAIDZ1 arrays, each one consisting of three discs or slices. sequential access will go down the cliff, but random reads will be boosted. Writes should be good if not great, no matter what the workload is. I'm interested in data that shows otherwise. You should also adjust the recordsize. For small files I certainly would not. Small files are stored as single record when they are smaller than the recordsize. Single record is good in my book. Not sure when one would want otherwise for small files. Try to measure the average I/O transaction size. There's a good chance that your I/O performance will be best if you set your recordsize to a smaller value. For instance, if your average file size is 12 KB, try using 8K or even 4K recordsize, stay away from 16K or higher. Tuning the record size is currently only recommended for databases (large file) with fixed record access. Again it's interesting input if tuning the recordsize helped another type of workload. -r -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 11 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Odp: Is ZFS efficient for large collections of small files?
£ukasz K writes: Is ZFS efficient at handling huge populations of tiny-to-small files - for example, 20 million TIFF images in a collection, each between 5 and 500k in size? I am asking because I could have sworn that I read somewhere that it isn't, but I can't find the reference. It depends, what type of I/O you will do. If only reads, there is no problem. Writting small files ( and removing ) will fragmentate pool and it will be a huge problem. You can set recordsize to 32k ( or 16k ) and it will help for some time. Comparing recordsize of 16K with 128K. Files in the range of [0,16K] : no difference. Files in the range of [16K,128K] : more efficient to use 128K Files in the range of [128K,500K] : more efficient to use 16K In the [16K,128K] range the actual filesize is rounded up to 16K with 16K recordsize and to the nearest 512B boundary with 128K recordsize. This will be fairly catastrophic for files slightly above 16K (rounded up to 32K vs 16K+512B). In the [128K, 500K] range we're hurt by this 5003563 use smaller tail block for last block of object until it is fixed, then yes , files stored using 16K records are rounded up more tightly. metadata probably east parts of the gains. -r Lukas CLUBNETIC SUMMER PARTY 2007 House, club, electro. Najlepsza kompilacja na letnie imprezy! http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fclubnetic.htmlsid=1266 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Odp: Is ZFS efficient for large collections of small files?
£ukasz K writes: Is ZFS efficient at handling huge populations of tiny-to-small files - for example, 20 million TIFF images in a collection, each between 5 and 500k in size? I am asking because I could have sworn that I read somewhere that it isn't, but I can't find the reference. It depends, what type of I/O you will do. If only reads, there is no problem. Writting small files ( and removing ) will fragmentate pool and it will be a huge problem. You can set recordsize to 32k ( or 16k ) and it will help for some time. Comparing recordsize of 16K with 128K. Files in the range of [0,16K] : no difference. Files in the range of [16K,128K] : more efficient to use 128K Files in the range of [128K,500K] : more efficient to use 16K In the [16K,128K] range the actual filesize is rounded up to 16K with 16K recordsize and to the nearest 512B boundary with 128K recordsize. This will be fairly catastrophic for files slightly above 16K (rounded up to 32K vs 16K+512B). In the [128K, 500K] range we're hurt by this 5003563 use smaller tail block for last block of object until it is fixed, then yes , files stored using 16K records are rounded up more tightly. metadata probably east parts of the gains. -r Lukas CLUBNETIC SUMMER PARTY 2007 House, club, electro. Najlepsza kompilacja na letnie imprezy! http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fclubnetic.htmlsid=1266 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] There is no NFS over ZFS issue
Regarding the bold statement There is no NFS over ZFS issue What I mean here is that,if you _do_ encounter a performance pathology not linked to the NVRAM Storage/cache flush issue then you _should_ complain or better get someone to do an analysis of the situation. One should not assume that someobserved pathological performance of NFS/ZFS is widespread and due to some known ZFS issue about to be fixed. To be sure, there are lots of performance opportunities that will provide incremental improvements the most significant of which ZFSSeparate Intent Log just integrated in Nevada. This opens up the field for further NFS/ZFS performance investigations. But the data that got this thread started seem to highlight an NFS vs Samba opportinity, something we need to look into. Otherwise I don't think that the data produced so far has hightlighted any specific NFS/ZFS issue.There are certainly opportinitiesfor incremental performance improvements but, to the best of my knowledge, outside the NVRAM/Flush issue on certain storage : There are no known prevalent NFS over ZFS performance pathologies on record. -r Ref: http://mail.opensolaris.org/pipermail/zfs-discuss/2007-June/thread.html#29026 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Slow write speed to ZFS pool (via NFS)
Sorry about that; looks like you've hit this: 6546683 marvell88sx driver misses wakeup for mv_empty_cv http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6546683 Fixed in snv_64. -r Thomas Garner writes: We have seen this behavior, but it appears to be entirely related to the hardware having the Intel IPMI stuff swallow up the NFS traffic on port 623 directly by the network hardware and never getting. http://blogs.sun.com/shepler/entry/port_623_or_the_mount Unfortunately, this nfs hangs across 3 separate machines, none of which should have this IPMI issue. It did spur me on to dig a little deeper, though, so thanks for the encouragement that all may not be well. Can anyone debug this? Remember that this is Nexenta Alpha 7, so it should be b61. nfsd is totally hung (rpc timeouts) and zfs would be having problems taking snapshots, if I hadn't disabled the hourly snapshots. Thanks! Thomas [EMAIL PROTECTED] ~]$ rpcinfo -t filer0 nfs rpcinfo: RPC: Timed out program 13 version 0 is not available echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k stack pointer for thread 821cda00: 822d6e28 822d6e5c swtch+0x17d() 822d6e8c cv_wait_sig_swap_core+0x13f(8b8a9232, 8b8a9200, 0) 822d6ea4 cv_wait_sig_swap+0x13(8b8a9232, 8b8a9200) 822d6ee0 cv_waituntil_sig+0x100(8b8a9232, 8b8a9200, 0) 822d6f44 poll_common+0x3e1(8069480, a, 0, 0) 822d6f84 pollsys+0x7c() 822d6fac sys_sysenter+0x102() stack pointer for thread 821d2e00: 8c279d98 8c279dcc swtch+0x17d() 8c279df4 cv_wait_sig+0x123(8988796e, 89887970) 8c279e2c svc_wait+0xaa(1) 8c279f84 nfssys+0x423() 8c279fac sys_sysenter+0x102() stack pointer for thread a9f88800: 8c92e218 8c92e244 swtch+0x17d() 8c92e254 cv_wait+0x4e(8a4169ea, 8a4169e0) 8c92e278 mv_wait_for_dma+0x32() 8c92e2a4 mv_start+0x278(88252c78, 89833498) 8c92e2d4 sata_hba_start+0x79(8987d23c, 8c92e304) 8c92e308 sata_txlt_synchronize_cache+0xb7(8987d23c) 8c92e334 sata_scsi_start+0x1b7(8987d1e4, 8987d1e0) 8c92e368 scsi_transport+0x52(8987d1e0) 8c92e3a4 sd_start_cmds+0x28a(8a2710c0, 0) 8c92e3c0 sd_core_iostart+0x158(18, 8a2710c0, 8da3be70) 8c92e3f8 sd_uscsi_strategy+0xe8(8da3be70) 8c92e414 sd_send_scsi_SYNCHRONIZE_CACHE+0xd4(8a2710c0, 8c50074c) 8c92e4b0 sdioctl+0x48e(1ac0080, 422, 8c50074c, 8010, 883cee68, 0) 8c92e4dc cdev_ioctl+0x2e(1ac0080, 422, 8c50074c, 8010, 883cee68, 0) 8c92e504 ldi_ioctl+0xa4(8a671700, 422, 8c50074c, 8010, 883cee68, 0) 8c92e544 vdev_disk_io_start+0x187(8c500580) 8c92e554 vdev_io_start+0x18(8c500580) 8c92e580 zio_vdev_io_start+0x142(8c500580) 8c92e59c zio_next_stage+0xaa(8c500580) 8c92e5b0 zio_ready+0x136(8c500580) 8c92e5cc zio_next_stage+0xaa(8c500580) 8c92e5ec zio_wait_for_children+0x46(8c500580, 1, 8c50076c) 8c92e600 zio_wait_children_ready+0x18(8c500580) 8c92e614 zio_next_stage_async+0xac(8c500580) 8c92e624 zio_nowait+0xe(8c500580) 8c92e660 zio_ioctl+0x94(9c6f8300, 89557c80, 89556400, 422, 0, 0) 8c92e694 zil_flush_vdev+0x54(89557c80, 0, 0, 8c92e6e0, 9c6f8500) 8c92e6e4 zil_flush_vdevs+0x6b(8bbe46c0) 8c92e734 zil_commit_writer+0x35f(8bbe46c0, 3497c, 0, 4af5, 0) 8c92e774 zil_commit+0x96(8bbe46c0, , , 4af5, 0) 8c92e7e8 zfs_putpage+0x1e4(8c8ab480, 0, 0, 0, 0, 8c6c75c0) 8c92e824 vhead_putpage+0x95(8c8ab480, 0, 0, 0, 0, 8c6c75c0) 8c92e86c fop_putpage+0x27(8c8ab480, 0, 0, 0, 0, 8c6c75c0) 8c92e91c rfs4_op_commit+0x153(82141dd4, b28c3100, 8c92ed8c, 8c92e948) 8c92ea48 rfs4_compound+0x1ce(8c92ead0, 8c92ea7c, 0, 8c92ed8c, 0) 8c92eaac rfs4_dispatch+0x65(8bf9b248, 8c92ed8c, b28c5a40, 8c92ead0) 8c92ed10 common_dispatch+0x6b0(8c92ed8c, b28c5a40, 2, 4, 8bf9c01c, 8bf9b1f0) 8c92ed34 rfs_dispatch+0x1f(8c92ed8c, b28c5a40) 8c92edc4 svc_getreq+0x158(b28c5a40, 842952a0) 8c92ee0c svc_run+0x146(898878e8) 8c92ee2c svc_do_run+0x6e(1) 8c92ef84 nfssys+0x3fb() 8c92efac sys_sysenter+0x102() snipping out a bunch of other threads ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Slow write speed to ZFS pool (via NFS)
Joe S writes: After researching this further, I found that there are some known performance issues with NFS + ZFS. I tried transferring files via SMB, and got write speeds on average of 25MB/s. So I will have my UNIX systems use SMB to write files to my Solaris server. This seems weird, but its fast. I'm sure Sun is working on fixing this. I can't imagine running a Sun box with out NFS. Call be a picky but : There is no NFS over ZFS issue (IMO/FWIW). There is a ZFS over NVRAM issue; well understood (not related to NFS). There is a Samba vs NFS issue; not well understood (not related to ZFS). This last bullet is probably better suited for [EMAIL PROTECTED] If ZFS is talking to storage array with NVRAM, then we have an issue (not related to NFS) described by : http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6462690 6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices The above bug/rfe lies in the SD driver but very much triggered by ZFS particularly running NFS, but not only. It affects only NVRAM based storage and is being worked on. If ZFS is talking to a JBOD, then the slowness is a characteristic of NFS (not related to ZFS). So FWIW on JBOD, there is no ZFS+NFS issue in the sense that I don't know howwe couldchange ZFS to be significantly better at NFS nor do I know how to change NFS that would help _particularly_ ZFS. Doesn't mean there is none, I just don't know about them. So please ping me if you highlight such an issue. So if one replaces ZFS by some other filesystem and gets large speedup I'm interested (make sure the other filesystem either runs with write cache off, or flushes it on NFS commit). So that leaves us with a Samba vs NFS issue (not related to ZFS). We know that NFS is able to create file _at most_ at one file per server I/O latency. Samba appears better and this is what we need to investigate. It might be better in a way that NFS can borrow (maybe through some better NFSV4 delegation code) or Samba might be better by being careless with data. If we find such an NFS improvement it will help all backend filesystems not just ZFS. Which is why I say: There is no NFS over ZFS issue. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and Tar/Star Performance
Hi Seigfried, just making sure you had seen this: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine You have very fast NFS to non-ZFS runs. That seems only possible if the hosting OS did not sync the data when NFS required it or the drive in question had some fast write caches. If the drive did have some FWC and ZFS was still slow using them, that would be the issue with flushing mention in the blog entry. but also maybe there is something to be learned from the Samba and AFP results... Takeaways: ZFS and NFS just work together. ZFS has an open issue with some storage array (the issue is *not* related to NFS); it's being worked on. Will need collaboration from storage vendors. NFS is slower than direct attached. Can be very very much slower on single threaded loads. There are many ways to workaround the slowness but most are just not safe for your data. -r Siegfried Nikolaivich writes: This is an old topic, discussed many times at length. However, I still wonder if there are any workarounds to this issue except disabling ZIL, since it makes ZFS over NFS almost unusable (it's a whole magnitude slower). My understanding is that the ball is in the hands of NFS due to ZFS's design. The testing results are below. Solaris 10u3 AMD64 server with Mac client over gigabit ethernet. The filesystem is on a 6 disk raidz1 pool, testing the performance of untarring (with bzip2) the Linux 2.6.21 source code. The archive is stored locally and extracted remotely. Locally --- tar xfvj linux-2.6.21.tar.bz2 real4m4.094s,user0m44.732s, sys 0m26.047s star xfv linux-2.6.21.tar.bz2 real1m47.502s, user0m38.573s, sys 0m22.671s Over NFS tar xfvj linux-2.6.21.tar.bz2 real48m22.685s, user0m45.703s, sys 0m59.264s star xfv linux-2.6.21.tar.bz2 real49m13.574s, user0m38.996s, sys 0m35.215s star -no-fsync -x -v -f linux-2.6.21.tar.bz2 real49m32.127s, user0m38.454s, sys 0m36.197s The performance seems pretty bad, lets see how other protocols fare. Over Samba -- tar xfvj linux-2.6.21.tar.bz2 real4m34.952s, user0m44.325s, sys 0m27.404s star xfv linux-2.6.21.tar.bz2 real4m2.998s,user0m44.121s, sys 0m29.214s star -no-fsync -x -v -f linux-2.6.21.tar.bz2 real4m13.352s, user0m44.239s, sys 0m29.547s Over AFP tar xfvj linux-2.6.21.tar.bz2 real3m58.405s, user0m43.132s, sys 0m40.847s star xfv linux-2.6.21.tar.bz2 real19m44.212s, user0m38.535s, sys 0m38.866s star -no-fsync -x -v -f linux-2.6.21.tar.bz2 real3m21.976s, user0m42.529s, sys 0m39.529s Samba and AFP are much faster, except the fsync'ed star over AFP. Is this a ZFS or NFS issue? Over NFS to non-ZFS drive - tar xfvj linux-2.6.21.tar.bz2 real5m0.211s,user0m45.330s, sys 0m50.118s star xfv linux-2.6.21.tar.bz2 real3m26.053s, user0m43.069s, sys 0m33.726s star -no-fsync -x -v -f linux-2.6.21.tar.bz2 real3m55.522s, user0m42.749s, sys 0m35.294s It looks like ZFS is the culprit here. The untarring is much faster to a single 80 GB UFS drive than a 6 disk raid-z array over NFS. Cheers, Siegfried PS. Getting netatalk to compile on amd64 Solaris required some changes since i386 wasn't being defined anymore, and somehow it thought the architecture was sparc64 for some linking steps. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - Use h/w raid or not? Thoughts. Considerations.
Torrey McMahon writes: Toby Thain wrote: On 25-May-07, at 1:22 AM, Torrey McMahon wrote: Toby Thain wrote: On 22-May-07, at 11:01 AM, Louwtjie Burger wrote: On 5/22/07, Pål Baltzersen [EMAIL PROTECTED] wrote: What if your HW-RAID-controller dies? in say 2 years or more.. What will read your disks as a configured RAID? Do you know how to (re)configure the controller or restore the config without destroying your data? Do you know for sure that a spare-part and firmware will be identical, or at least compatible? How good is your service subscription? Maybe only scrapyards and museums will have what you had. =o Be careful when talking about RAID controllers in general. They are not created equal! ... Hardware raid controllers have done the job for many years ... Not quite the same job as ZFS, which offers integrity guarantees that RAID subsystems cannot. Depend on the guarantees. Some RAID systems have built in block checksumming. Which still isn't the same. Sigh. Yep.you get what you pay for. Funny how ZFS is free to purchase isn't it? With RAID level block checksumming, if the data gets corrupted on it's way _to_ the array, that data is lost. With ZFS and RAID-Z or Mirroring, you will recover the data. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
Ian Collins writes: Roch Bourbonnais wrote: with recent bits ZFS compression is now handled concurrently with many CPUs working on different records. So this load will burn more CPUs and acheive it's results (compression) faster. Would changing (selecting a smaller) filesystem record size have any effect? If the problem is that we just have a high kernel load compressing blocks, then probably not. If anything small records might be a tad less efficient (thus needing more CPU). So the observed pauses should be consistent with that of a load generating high system time. The assumption is that compression now goes faster than when is was single threaded. Is this undesirable ? We might seek a way to slow down compression in order to limit the system load. I think you should, otherwise we have a performance throttle that scales with the number of cores! Again I wonder to what extent the issue becomes painful due to lack of write throttling. Once we have that in, we should revisit this. -r Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC, mmap, pagecache...
Manoj Joseph writes: Hi, I was wondering about the ARC and its interaction with the VM pagecache... When a file on a ZFS filesystem is mmaped, does the ARC cache get mapped to the process' virtual memory? Or is there another copy? My understanding is, The ARC does not get mapped to user space. The data ends up in the ARC (recordsize chunks) and in the page cache (in page chunks). Both copies are updated on writes. -r -Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync
Robert Milkowski writes: Hello Wee, Thursday, April 26, 2007, 4:21:00 PM, you wrote: WYT On 4/26/07, cedric briner [EMAIL PROTECTED] wrote: okay let'say that it is not. :) Imagine that I setup a box: - with Solaris - with many HDs (directly attached). - use ZFS as the FS - export the Data with NFS - on an UPS. Then after reading the : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations I wonder if there is a way to tell the OS to ignore the fsync flush commands since they are likely to survive a power outage. WYT Cedric, WYT You do not want to ignore syncs from ZFS if your harddisk is directly WYT attached to the server. As the document mentioned, that is really for WYT Complex Storage with NVRAM where flush is not necessary. What?? Setting zil_disable=1 has nothing to do with NVRAM in storage arrays. It disables ZIL in ZFS wich means that if application calls fsync() or opens a file with O_DSYNC, etc. then ZFS won't honor it (return immediatelly without commiting to stable storage). Once txg group closes data will be written to disks and SCSI write cache flush commands will be send. Setting zil_disable to 1 is not that bad actually, and if someone doesn't care to lose some last N seconds of data in case of server crash (however zfs itself will be consistent) it can actually speed up nfs operations a lot. ...set zil_disable...speed up nfs...at the expense of a risk of corruption of the NFS client's view. We must never forget this. zil_disable is really not an option IMO. -r -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync
Wee Yeh Tan writes: Robert, On 4/27/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Wee, Thursday, April 26, 2007, 4:21:00 PM, you wrote: WYT On 4/26/07, cedric briner [EMAIL PROTECTED] wrote: okay let'say that it is not. :) Imagine that I setup a box: - with Solaris - with many HDs (directly attached). - use ZFS as the FS - export the Data with NFS - on an UPS. Then after reading the : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations I wonder if there is a way to tell the OS to ignore the fsync flush commands since they are likely to survive a power outage. WYT Cedric, WYT You do not want to ignore syncs from ZFS if your harddisk is directly WYT attached to the server. As the document mentioned, that is really for WYT Complex Storage with NVRAM where flush is not necessary. What?? Setting zil_disable=1 has nothing to do with NVRAM in storage arrays. It disables ZIL in ZFS wich means that if application calls fsync() or opens a file with O_DSYNC, etc. then ZFS won't honor it (return immediatelly without commiting to stable storage). Wait a minute. Are we talking about zil_disable or zfs_noflush (or zfs_nocacheflush)? The article quoted was about configuring the array to ignore flush commands or device specific zfs_noflush, not zil_disable. I agree that zil_disable is okay from FS view (correctness still depends on the application), but zfs_noflush is dangerous. For me, both are dangerous. zil_disable can cause immense pain to applications and NFS clients. I don't see how anyone can recommend itwithout mentioning the risk of application/NFS corruption. zfs_nocacheflush is also unsafe. It opens a risk of pool corruption ! But, if you have *all* of your pooled data on safe NVRAM protected storage, and that you don't find a way to tell the storage to ignore cache flush requests, you might want to set the variable temporarily until the SYNC_NV thing is sorted out. Then make sure, nobody imports the tunable elsewhere without full understanding and make sure noone creates a new pool with non-NVRAM storage. Since those things are not under anyones control, it's not a good idea to spread these kind of recommendations. -- Just me, Wire ... Blog: prstat.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cow performance penatly
Chad Mynhier writes: On 4/27/07, Erblichs [EMAIL PROTECTED] wrote: Ming Zhang wrote: Hi All I wonder if any one have idea about the performance loss caused by COW in ZFS? If you have to read old data out before write it to some other place, it involve disk seek. Ming, Lets take a pro example with a minimal performance tradeoff. All FSs that modify a disk block, IMO, do a full disk block read before anything. Actually, I'd say that this is the main point that needs to be made. If you're modifying data that was once on disk, that data had to be read from at some point in the past. This is invariably true for any filesystem. Nits, just so readers are clear about this : the read of old data to service a write, needs only be done when handling a write of a partial filesystem block (and the data is not cached as mentioned). For a fixed size block database with matching ZFS recordsize, then writes will mostly be handled without a need to read previous data. Most FS should behave the same here. With traditional filesystems, that data block is rewritten in the same place. If it were the case that disk blocks were always written immediately after being read, with no intervening I/O to cause a disk seek, COW would have no performance benefit over traditional filesystems. (Well, this isn't true, as there are other benefits to be had.) But it's rarely (if ever) the case that this happens. The modified block is generally written some time after the original block was read, with plenty of intervening I/O that leaves the disk head over some randome location on the platter. So for traditional filesystems, the in-place write of a modified block will typically involve a disk seek. And a second point to be made about this is the effect of caching. With any filesystem, writes are cached in memory and flushed out to disk on a regular basis. With traditional filesystems, flushing the cache involves a set of random writes on the disk, which is possibly going to involve a disk seek for every block written. (In the best case, writes could be reordered in ascending order across the disk to miinimize the disk seeks, but there would still possibly be a small disk seek between each write.) With a COW filesystem, flushing the cache involves writing sequentially to disk with no intervening disk seeks. (This assumes that there's enough free space on disk to avoid fragmentation.) In the ideal case, this means writing to disk at full platter speed. This is where the main performance benefit of COW comes from. yep. Chad Mynhier ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync
cedric briner writes: You might set zil_disable to 1 (_then_ mount the fs to be shared). But you're still exposed to OS crashes; those would still corrupt your nfs clients. Just to better understand ? (I know that I'm quite slow :( ) when you say _nfs clients_ are you specifically talking of: - the nfs client program itself : (lockd, statd) meaning that you can have a stale nfs handle or other things ? - the host acting as an nfs client meaning that the nfs client service works, but you would have corrupt the data that the software use with nfs's mounted disk. It's rather applications running on the client. Basically, we would have data loss from application's perspective running on client without any sign of errors. It's a bit like having a disk that would drop a write request and not signal an error. If I'm digging and digging against this ZIL and NFS UFS with write cache, that's because I do not understand which kind of problems that can occurs. What I read in general is statement like _corruption_ of the client's point of view.. but what does that means ? is the shema of what can happen is : - the application on the nfs client side write data on the nfs server - meanwhile the nfs server crashes so: - the data are not stored - the application on the nfs client think that the data are stored ! :( - when the server is up again - the nfs client re-see the data - the application on the nfs client side find itself with data in the previous state of its lasts writes. Am I right ? The scenario I see would be on the client, download some software (a tar file). tar x make The tar succeeded with no errors at all. Behind our back during the tar x, the server rebooted. No big deal normally. But with zil_disable on the server, the make fails, either because some files from the original tar are missing or parts of files. So with ZIL: - The application has the ability to do things in the right way. So even of a nfs-server crash, the application on the nfs-client side can rely on is own data. So without ZIL: - The application has not the ability to do things in the right way. And we can have a corruption of data. But that doesn't mean corruption of the FS. It means that the data were partially written and some are missing. Sounds right. For the love of God do NOT do stuff like that. Just create ZFS on a pile of disks the way that we should, with the write cache disabled on all the disks and with redundancy in the ZPool config .. nothing special : Wh !!noo.. this is really special to me !! I've read and re-read many times the: - NFS and ZFS, a fine combination - ZFS Best Practices Guide and other blog without remarking such idea ! I even notice the opposite recommendation from: -ZFS Best Practices Guide ZFS Storage Pools Recommendations -http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Storage_Pools_Recommendations where I read : - For production systems, consider using whole disks for storage pools rather than slices for the following reasons: + Allow ZFS to enable the disk's write cache, for those disks that have write caches and from: -NFS and ZFS, a fine combination Comparison with UFS -http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison where I read : Semantically correct NFS service : nfs/ufs : 17 sec (write cache disable) nfs/zfs : 12 sec (write cache disable,zil_disable=0) nfs/zfs : 7 sec (write cache enable,zil_disable=0) then I can say: that nfs/zfs with write cache enable end zil_enable is --in that case-- faster So why are you recommending me to disable the write cache ? For ZFS, it can work either way. Maybe the above was a typo. -- Cedric BRINER Geneva - Switzerland ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync
You might set zil_disable to 1 (_then_ mount the fs to be shared). But you're still exposed to OS crashes; those would still corrupt your nfs clients. -r cedric briner writes: Hello, I wonder if the subject of this email is not self-explanetory ? okay let'say that it is not. :) Imagine that I setup a box: - with Solaris - with many HDs (directly attached). - use ZFS as the FS - export the Data with NFS - on an UPS. Then after reading the : http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations I wonder if there is a way to tell the OS to ignore the fsync flush commands since they are likely to survive a power outage. Ced. -- Cedric BRINER Geneva - Switzerland ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: storage type for ZFS
Richard L. Hamilton writes: Well, no; his quote did say software or hardware. The theory is apparently that ZFS can do better at detecting (and with redundancy, correcting) errors if it's dealing with raw hardware, or as nearly so as possible. Most SANs _can_ hand out raw LUNs as well as RAID LUNs, the folks that run them are just not used to doing it. Another issue that may come up with SANs and/or hardware RAID: supposedly, storage systems with large non-volatile caches will tend to have poor performance with ZFS, because ZFS issues cache flush commands as part of committing every transaction group; this is worse if the filesystem is also being used for NFS service. Most such hardware can be configured to ignore cache flushing commands, which is safe as long as the cache is non-volatile. The above is simply my understanding of what I've read; I could be way off base, of course. Sounds good to me. The first point is easy to understand. If you rely on ZFS for data reconstruction; carve virtual luns out of your storage and mirror those luns in ZFS, then it's possible that both copies of a mirrored block end up on a single physical device. Performance wise, the ZFS I/O scheduler might interact in interesting way with the one in the storage, but I don't know if this has been studied in depth. -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs block allocation strategy
tester writes: Hi, quoting from zfs docs The SPA allocates blocks in a round-robin fashion from the top-level vdevs. A storage pool with multiple top-level vdevs allows the SPA to use dynamic striping to increase disk bandwidth. Since a new block may be allocated from any of the top-level vdevs, the SPA implements dynamic striping by spreading out writes across all available top-level vdevs Now, if I need two filesystems, /protect (mirrored - 2physical) and /fast_unprot (striped - 3physical), is it correct that we end up with 2 top-level vedvs. If that is the case then from the above paragraph does it mean that blocks for either filesystem blocks can end up in any of the 5 physicals. What happens to the intended protection and performance? I am sure, I am missing some basics here. you probably end up with 2 pools, 1 mirrored vdev, and 1 stripe of 3 vdevs. -r Thanks for the clarification This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] query on ZFS
Annie Li writes: Can anyone help explain what does out-of-order issue mean in the following segment? ZFS has a pipelined I/O engine, similar in concept to CPU pipelines. The pipeline operates on I/O dependency graphs and provides scoreboarding, priority, deadline scheduling, out-of-order issue and I/O aggregation. I/O loads that bring other filesystems to their knees http://blogs.sun.com/roller/page/bill?entry=zfs_vs_the_benchmark are handled with ease by the ZFS I/O pipeline. Thanks, Annie ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss As an example, it says that, even if a read was issued by an application 'after' ZFS had started to work on a group of write I/Os, the read could actually issue ahead of some of the writes. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS improvements
Gino writes: 6322646 ZFS should gracefully handle all devices failing (when writing) Which is being worked on. Using a redundant configuration prevents this from happening. What do you mean with redundant? All our servers has 2 or 4 HBAs, 2 or 4 fc switches and storage arrays with redundant controllers. We used only RAID10 zpools but we still had them corrupted. Redundant from the viewpoint of ZFS. So either zfs mirror of raid-z. The point of the bug is to better handle failures on devices in non-redundant pools. For redundant pools, ZFS is able to self-heal problems as they arise. If you maintain redundancy at the storage level, then it's harder for ZFS to deal with problems. We should still behave better than we do now thus 6322646. Can you post your zpool status output ? -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] C'mon ARC, stay small...
Jason J. W. Williams writes: Hi Guys, Rather than starting a new thread I thought I'd continue this thread. I've been running Build 54 on a Thumper since Mid January and wanted to ask a question about the zfs_arc_max setting. We set it to 0x1 #4GB, however its creeping over that till our Kernel memory usage is nearly 7GB (::memstat inserted below). This is a database server so I was curious if the DNLC would have this affect over time, as it does quite quickly when dealing with small files? Would it be worth upgrade to Build 59? Another possibility is that, there is a portion of memory that might be in the kmem caches, ready to be reclaimed and returned to the OS free space. Such reclaims currently only occurs on memory shortage. I think we should do it under some more conditions... This might fall under: CrNumber: 6416757 Synopsis: zfs should return memory eventually http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6416757 If you induce some temporary memory pressure, it would be nice to see if you're kernel shrinks down to ~4GB. -r Thank you in advance! Best Regards, Jason Page SummaryPagesMB %Tot Kernel1750044 6836 42% Anon 1211203 4731 29% Exec and libs7648290% Page cache 220434 8615% Free (cachelist) 318625 12448% Free (freelist)659607 2576 16% Total 4167561 16279 Physical 4078747 15932 On 3/23/07, Roch - PAE [EMAIL PROTECTED] wrote: With latest Nevada setting zfs_arc_max in /etc/system is sufficient. Playing with mdb on a live system is more tricky and is what caused the problem here. -r [EMAIL PROTECTED] writes: Jim Mauro wrote: All righty...I set c_max to 512MB, c to 512MB, and p to 256MB... arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t16588228608 c02e29f8 uint64_t c = 0t33176457216 c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t33176457216 ... } c02e2a08 /Z 0x2000 arc+0x48: 0x7b9789000 = 0x2000 c02e29f8 /Z 0x2000 arc+0x38: 0x7b9789000 = 0x2000 c02e29f0 /Z 0x1000 arc+0x30: 0x3dcbc4800 = 0x1000 arc::print -tad { ... c02e29e8 uint64_t size = 0t299008 c02e29f0 uint64_t p = 0t268435456 -- p is 256MB c02e29f8 uint64_t c = 0t536870912 -- c is 512MB c02e2a00 uint64_t c_min = 0t1070318720 c02e2a08 uint64_t c_max = 0t536870912--- c_max is 512MB ... } After a few runs of the workload ... arc::print -d size size = 0t536788992 Ah - looks like we're out of the woods. The ARC remains clamped at 512MB. Is there a way to set these fields using /etc/system? Or does this require a new or modified init script to run and do the above with each boot? Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] 6410 expansion shelf
Robert Milkowski writes: Hello Selim, Wednesday, March 28, 2007, 5:45:42 AM, you wrote: SD talking of which, SD what's the effort and consequences to increase the max allowed block SD size in zfs to highr figures like 1M... Think what would happen then if you try to read 100KB of data - due to chekcsumming ZFS would have to read entire MB. However it should be possible to batch several IOs together and issue one larger with ZFS - at least I hope it's possible. As you note The max coherency unit (blocksize) in ZFS is 128K. It's also the max I/O size. And smaller I/O are already aggregated or batched up to that size. At 128K size the control to data ratio on the wire is already quite reasonable. So I don't see much benefit to increasing this (there maybe some but the context needs to be well defined). The issue subject to debate because traditionally, one I/O came with an implied overhead of a full head seek. In that case, the larger the I/O the better. So at 60MB/s throughput and 5ms head seek time, we need I/Os 300K to make the data transfer time larger than the seek time and ~ 3MB I/O sizes to reach the point of diminishing return. But with a write allocate scheme we are not hit with the head seek for every I/O and common I/O size wisdom needs to be reconsidered. -r -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Kstats
See Kernel Statistics Library Functions kstat(3KSTAT) -r Atul Vidwansa writes: Peter, How do I get those stats programatically? Any clues? Regards, _Atul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] missing features?Could/should zfs support a new ioctl, constrained if neede
Richard L. Hamilton writes: _FIOSATIME - why doesn't zfs support this (assuming I didn't just miss it)? Might be handy for backups. Are these syscall sufficent ? int utimes(const char *path, const struct timeval times[2]); int futimesat(int fildes, const char *path, const struct timeval times[2]); Could/should zfs support a new ioctl, constrained if needed to files of zero size, that sets an explicit (and fixed) blocksize for a particular file? That might be useful for performance in special cases when one didn't necessarily want to specify (or depend on the specification of perhaps) the attribute at the filesystem level. One could imagine a database that was itself tunable per-file to a similar range of blocksizes, which would almost certainly benefit if it used those sizes for the corresponding files. Additional capabilities that might be desirable: setting the blocksize to zero to let the system return to default behavior for a file; being able to discover the file's blocksize (does fstat() report this?) as well as whether it was fixed at the filesystem level, at the file level, or in default state. Yep, It does look interesting. Wasn't there some work going on to add real per-user (and maybe per-group) quotas, so one doesn't necessarily need to be sharing or automounting thousands of individual filesystems (slow)? Haven't heard anything lately though... What is slow here is mounting all those FS at boot and unmounting at shutdown. The most relevant project here in my mind is : 6478980 zfs should support automount property which would give ZFS a mount on demand behavior. Fast boot/shutdown and fewer mounted FS at any one time. Then we need to make administrating many user FS as painless as administring a single one. -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS performance with Oracle
JS writes: The big problem is that if you don't do your redundancy in the zpool, then the loss of a single device flatlines the system. This occurs in single device pools or stripes or concats. Sun support has said in support calls and Sunsolve docs that this is by design, but I've never seen the loss of any other filesystem cause a machine to halt and dump core. Multiple bus resets can create a condition that makes the kernel believe that the device is no longer available. This was a persistant problem, especially on Pillar, until I started using setting sd_max_throttle down. Such failures are certainly not by design and my understanding is that it's being very actively worked on. This said, redundancy in the zpool is a great idea. At the least it protects the path between the filesystem and the storage. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: ZFS memory and swap usage
Hi Mike, This already integrated in Nevada: 6510807 ARC statistics should be exported via kstat kstat zfs:0:arcstats module: zfs instance: 0 name: arcstatsclass:misc c 534457344 c_max 16028893184 c_min 534457344 crtime 6301.4284957 deleted 1149800 demand_data_hits4514722 demand_data_misses 54810 demand_metadata_hits289342 demand_metadata_misses 5203 evict_skip 0 hash_chain_max 8 hash_chains 8192 hash_collisions 1243605 hash_elements 53250 hash_elements_max 250443 hits9929297 mfu_ghost_hits 3917 mfu_hits2496914 misses 60013 mru_ghost_hits 29072 mru_hits2596064 mutex_miss 4791 p 210483584 prefetch_data_hits 5125227 prefetch_data_misses0 prefetch_metadata_hits 6 prefetch_metadata_misses0 recycle_miss2338 size439890944 snaptime939404.5920782 -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS memory and swap usage
Info on tuning the ARC was just recently updated: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Memory_and_Dynamic_Reconfiguration_Recommendations -r Rainer Heilke writes: Thanks for the feedback. Please see below. ZFS should give back memory used for cache to system if applications are demanding it. Right it should but sometimes it won't. However with databases there's simple workaround - as you know how much ram all databases will consume at least you can limit ZFS's arc cache to remaining free memory (and possibly reduce it even more byt 2-3x factor). For details on how to do it see 'C'mon ARC, stay small...' thread here. So if you have 16GB RAM in a system and want 10GB for SGA + another 2GB for Oracle + 1GB for other kernel resources you are with 3GB left. So I would limit arc c_max to 3GB or even to 1GB. I was of the understanding that this kernel setting was only introduced in newer Nevada builds. Does this actually work under Solaris 10, Update 3? Thanks again. Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: ZFS memory and swap usage
Rainer Heilke writes: The updated information states that the kernel setting is only for the current Nevada build. We are not going to use the kernel debugger method to change the setting on a live production system (and do this everytime we need to reboot). We're back to trying to set their expectations more realistically, and using proper tools to measure memory usage. As I stated at the outset, they are trying to start up a 10GB SGA database within two minutes to simulate the start-up of five 2GB databases at boot-up. I sincerely doubt they are going to start all five databases simultaneously within two minutes on a regular boot-up. After bootup, ZFS should have near zero memory in the ARC. Limiting the ARC should have no effect on their startup times. Right ? -r So, what is the best use of the OS tools (vmstat, etc.) to show them how this would really occur? Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS stalling problem
Working with a small txg_time means we are hit by the pool sync overhead more often. This is why the per second throughpuot has smaller peak values. With txg_time = 5, we have another problem which is that depending on timing of the pool sync, some txg can end up with too little data in them and sync quickly. We're closing in (I hope) on fixing both issues: 6429205 each zpool needs to monitor its throughput and throttle heavy writers 6415647 Sequential writing is jumping http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id= -r Jesse DeFer writes: OK, I tried it with txg_time set to 1 and am seeing less predictable results. The first time I ran the test it completed in 27 seconds (vs 24s for ufs or 42s with txg_time=5). Further tests ran from 27s to 43s, about half the time greater than 40s. zpool iostat doesn't show the large no-writes gaps, but it is still very bursty and peak bandwidth is lower. Here is a 29s run: tank 113K 464G 0 0 0 0 tank 113K 464G 0226 0 28.2M tank40.1M 464G 0441 0 46.9M tank88.2M 464G 0384 0 39.8M tank 136M 464G 0445 0 47.4M tank 184M 464G 0412 0 43.4M tank 232M 464G 0411 0 43.2M tank 272M 464G 0402 0 42.1M tank 320M 464G 0435 0 46.3M tank 368M 464G 0366 63.4K 37.7M tank 408M 464G 0494 0 53.6M tank 456M 464G 0360 0 36.8M tank 496M 464G 0420 0 44.5M tank 544M 463G 0439 0 46.8M tank 585M 463G 0370 0 38.2M tank 633M 463G 0407 0 42.6M tank 673M 463G 0457 0 49.0M tank 713M 463G 0368 0 37.9M tank 761M 463G 0443 0 47.2M tank 801M 463G 0380 63.4K 39.4M tank 844M 463G 0444 63.4K 47.4M tank 879M 463G 0184 0 14.9M tank 879M 463G 0339 0 33.4M tank 913M 463G 0215 0 26.5M tank 944M 463G 0393 63.4K 36.4M tank 976M 463G 0171 63.4K 10.5M tank1008M 463G 0237 63.4K 21.6M tank1008M 463G 0312 0 31.5M tank1.02G 463G 0137 0 9.05M tank1.05G 463G 0313 0 23.4M tank1.05G 463G 0 0 0 0 Jesse Jesse, This isn't a stall -- it's just the natural rhythm of pushing out transaction groups. ZFS collects work (transactions) until either the transaction group is full (measured in terms of how much memory the system has), or five seconds elapse -- whichever comes first. Your data would seem to suggest that the read side isn't delivering data as fast as ZFS can write it. However, it's possible that there's some sort of 'breathing' effect that's hurting performance. One simple experiment you could try: patch txg_time to 1. That will cause ZFS to push transaction groups every second instead of the default of every 5 seconds. If this helps (or if it doesn't), please let us know. Thanks, Jeff Jesse DeFer wrote: Hello, I am having problems with ZFS stalling when writing, any help in troubleshooting would be appreciated. Every 5 seconds or so the write bandwidth drops to zero, then picks up a few seconds later (see the zpool iostat at the bottom of this message). I am running SXDE, snv_55b. My test consists of copying a 1gb file (with cp) between two drives, one 80GB PATA, one 500GB SATA. The first drive is the system drive (UFS), the second is for data. I have configured the data drive with UFS and it does not exhibit the stalling problem and it runs in almost half the time. I have tried many different ZFS settings as well: atime=off, compression=off, checksums=off, zil_disable=1 all to no effect. CPU jumps to about 25% system time during the stalls, and hovers around 5% when data is being transferred. # zpool iostat 1 capacity operationsbandwidth sed avail read write read write -- - - - - - - tank 183M 464G 0 17 1.12K 1.93M tank 183M 464G 0457 0 57.2M tank 183M 464G 0445 0 55.7M tank 183M 464G 0405 0 50.7M tank 366M 464G 0226 0 4.97M tank 366M 464G 0 0 0 0 tank 366M 464G 0 0 0 0 tank
Re: Re[2]: [zfs-discuss] writes lost with zfs !
Did you run touch from a client ? ZFS and UFS are different in general but in response to a local touch command neither need to generate immediate I/O and in response to a client touch both do. -r Ayaz Anjum writes: HI ! Well as per my actual post, i created a zfs file as part of Sun cluster HAStoragePlus, and then disconned the FC cable, since there was no active IO hence the failure of disk was not detected, then i touched a file in the zfs filesystem, and it went fine, only after that when i did sync then the node panicked and zfs filesystem is failed over to other node. On the othernode the file i touched is not there in the same zfs file system hence i am saying that data is lost. I am planning to deploy zfs in a production NFS environment with above 2TB of Data where users are constantly updating file. Hence my concerns about data integrity. Please explain. thaks Ayaz Anjum Darren Dunham [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 03/12/2007 05:45 AM To zfs-discuss@opensolaris.org cc Subject Re: Re[2]: [zfs-discuss] writes lost with zfs ! I have some concerns here, from my experience in the past, touching a file ( doing some IO ) will cause the ufs filesystem to failover, unlike zfs where it did not ! Why the behaviour of zfs different than ufs ? UFS always does synchronous metadata updates. So a 'touch' that creates a file is going to require a metadata write. ZFS writes may not necessarily hit the disk until a transaction group flush. is not this compromising data integrity ? It should not. Is there a scenario that you are worried about? -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Confidentiality Notice : This e-mail and any attachments are confidential to the addressee and may also be privileged. If you are not the addressee of this e-mail, you may not copy, forward, disclose or otherwise use it in any way whatsoever. If you have received this e-mail by mistake, please e-mail the sender by replying to this message, and delete the original and any print out thereof. brfont size=3 face=sans-serifHI !/font br brfont size=3 face=sans-serifWell as per my actual post, i created a zfs file as part of Sun cluster HAStoragePlus, and then disconned the FC cable, since there was no active IO hence the failure of disk was not detected, then i touched a file in the zfs filesystem, and it went fine, only after that when i did sync then the node panicked and zfs filesystem is failed over to other node. On the othernode the file i touched is not there in the same zfs file system hence i am saying that data is lost. I am planning to deploy zfs in a production NFS environment with above 2TB of Data where users are constantly updating file. Hence my concerns about data integrity. Please explain./font br brfont size=3 face=sans-serifthaks/font br brfont size=3 face=sans-serifAyaz Anjumbr /font br br br table width=100% tr valign=top td width=40%font size=1 face=sans-serifbDarren Dunham lt;[EMAIL PROTECTED]gt;/b /font brfont size=1 face=sans-serifSent by: [EMAIL PROTECTED]/font pfont size=1 face=sans-serif03/12/2007 05:45 AM/font td width=59% table width=100% tr td div align=rightfont size=1 face=sans-serifTo/font/div td valign=topfont size=1 face=sans-serifzfs-discuss@opensolaris.org/font tr td div align=rightfont size=1 face=sans-serifcc/font/div td valign=top tr td div align=rightfont size=1 face=sans-serifSubject/font/div td valign=topfont size=1 face=sans-serifRe: Re[2]: [zfs-discuss] writes lost with zfs !/font/table br table tr valign=top td td/table br/table br br brfont size=2ttgt; I have some concerns here, nbsp;from my experience in the past, touching a br gt; file ( doing some IO ) will cause the ufs filesystem to failover, unlike br gt; zfs where it did not ! Why the behaviour of zfs different than ufs ?br br UFS always does synchronous metadata updates. nbsp;So a 'touch' that createsbr a file is going to require a metadata write.br br ZFS writes may not necessarily hit the disk until a transaction groupbr flush. nbsp;br br gt; is not this compromising data integrity ?br br It should not. nbsp;Is there a scenario that you are worried about?br
Re: [zfs-discuss] Re: ZFS/UFS layout for 4 disk servers
Frank Cusack writes: On March 7, 2007 8:50:53 AM -0800 Matt B [EMAIL PROTECTED] wrote: Any thoughts on the best practice points I am raising? It disturbs me that it would make a statement like don't use slices for production. I think that's just a performance thing. Right, I think what would be very unoptimal from ZFS standpoint would be to configure 2 slices from _one_ disk into a given zpool. This would send the I/O scheduler on a tangent, but it would nevertheless still work. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS/UFS layout for 4 disk servers
Manoj Joseph writes: Matt B wrote: Any thoughts on the best practice points I am raising? It disturbs me that it would make a statement like don't use slices for production. ZFS turns on write cache on the disk if you give it the entire disk to manage. It is good for performance. So, you should use whole disks when ever possible. Just a small clarification to state that the extra performance that comes from having the write cache on applies mostly to disks that do not have other means of command concurrency (NCQ, CTQ). With NCQ/CTQ, the write cache setting should not matter much to ZFS performance. -r Slices work too, but write cache for the disk will not be turned on by zfs. Cheers Manoj ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS stalling problem
Jesse, You can change txg_time with mdb echo txg_time/W0t1 | mdb -kw -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why number of NFS threads jumps to the max value?
Leon Koll writes: On 2/28/07, Roch - PAE [EMAIL PROTECTED] wrote: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988 NFSD threads are created on a demand spike (all of them waiting on I/O) but thentend to stick around servicing moderate loads. -r Hello Roch, It's not my case. NFS stops to service after some point. And the reason is in ZFS. It never happens with NFS/UFS. Shortly, my scenario: 1st SFS run, 2000 requested IOPS. NFS is fine, ;low number of threads. 2st SFS run, 4000 requested IOPS. NFS cannot serve all requests, no of threads jumps to max 3rd SFS run, 2000 requested IOPS. NFS cannot serve all requests, no of threads jumps to max. System cannot get back to the same results under equal load (1st and 3rd). Reboot between 2nd and 3rd doesn't help. The only persistent thing is a directory structure that was created during the 2nd run (in SFS higher requested load - more directories/files created). I am sure it's a bug. I need help. I don't care that ZFS works N times worse than UFS. I really care that after heavy load everything is totally screwed. Thanks, -- Leon Hi Leon, How much is the slowdown between 1st and 3rd ? How filled is the pool at each stage ? What does 'NFS stops to service' mean ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why number of NFS threads jumps to the max value?
Leon Koll writes: On 3/5/07, Roch - PAE [EMAIL PROTECTED] wrote: Leon Koll writes: On 2/28/07, Roch - PAE [EMAIL PROTECTED] wrote: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988 NFSD threads are created on a demand spike (all of them waiting on I/O) but thentend to stick around servicing moderate loads. -r Hello Roch, It's not my case. NFS stops to service after some point. And the reason is in ZFS. It never happens with NFS/UFS. Shortly, my scenario: 1st SFS run, 2000 requested IOPS. NFS is fine, ;low number of threads. 2st SFS run, 4000 requested IOPS. NFS cannot serve all requests, no of threads jumps to max 3rd SFS run, 2000 requested IOPS. NFS cannot serve all requests, no of threads jumps to max. System cannot get back to the same results under equal load (1st and 3rd). Reboot between 2nd and 3rd doesn't help. The only persistent thing is a directory structure that was created during the 2nd run (in SFS higher requested load - more directories/files created). I am sure it's a bug. I need help. I don't care that ZFS works N times worse than UFS. I really care that after heavy load everything is totally screwed. Thanks, -- Leon Hi Leon, How much is the slowdown between 1st and 3rd ? How filled is Typical case is: 1st: 1996 IOPS, latency 2.7 3rd: 1375 IOPS, latency 37.9 The large latency increase is the side effect of requesting more than what can be delivered. Queue builds up and latency follow. So it should not be the primary focus IMO. The Decrease in IOPS is the primary problem. One hypothesis is that over the life of the FS we're moving toward spreading access to the full disk platter. We can imagine some fragmentation hitting as well. I'm not sure how I'd test both hypothesis. the pool at each stage ? What does 'NFS stops to service' mean ? There is a lot of error messages on the NFS(SFS) client : sfs352: too many failed RPC calls - 416 good 27 bad sfs3132: too many failed RPC calls - 302 good 27 bad sfs3109: too many failed RPC calls - 533 good 31 bad sfs353: too many failed RPC calls - 301 good 28 bad sfs3144: too many failed RPC calls - 305 good 25 bad sfs3121: too many failed RPC calls - 311 good 30 bad sfs370: too many failed RPC calls - 315 good 27 bad Can this be timing out or queue full drops ? Might be a side effect of SFS requesting more than what can be delivered. Thanks, -- Leon ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why number of NFS threads jumps to the max value?
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988 NFSD threads are created on a demand spike (all of them waiting on I/O) but thentend to stick around servicing moderate loads. -r Leon Koll wrote: Hello, gurus I need your help. During the benchmark test of NFS-shared ZFS file systems at some moment the number of NFS threads jumps to the maximal value, 1027 (NFSD_SERVERS was set to 1024). The latency also grows and the number of IOPS is going down. I've collected the output of echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k that can be seen here: http://tinyurl.com/yrvn4z Could you please look at it and tell me what's wrong with my NFS server. Appreciate, -- Leon This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
Jeff Davis writes: On February 26, 2007 9:05:21 AM -0800 Jeff Davis But you have to be aware that logically sequential reads do not necessarily translate into physically sequential reads with zfs. zfs I understand that the COW design can fragment files. I'm still trying to understand how that would affect a database. It seems like that may be bad for performance on single disks due to the seeking, but I would expect that to be less significant when you have many spindles. I've read the following blogs regarding the topic, but didn't find a lot of details: http://blogs.sun.com/bonwick/entry/zfs_block_allocation http://blogs.sun.com/realneel/entry/zfs_and_databases Here is my take on this: DB updates (writes) are mostly governed by the synchronous write code path which for ZFS means the ZIL performance. It's already quite good in that it aggregatesmultiple updates into few I/Os. Some further improvements are in the works. COW, in general, simplify greatly write code path. DB reads in a transaction workloads are mostly random. If the DB is not cacheable the performance will be that of a head seek no matter what FS is used (since we can't guess in advance where to seek, COW nature does not help nor hinders performance). DB reads in a decision workloads can benefit from good prefetching (since here we actually know where the next seeks will be). -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Efficiency when reading the same file blocks
Frank Hofmann writes: On Tue, 27 Feb 2007, Jeff Davis wrote: Given your question are you about to come back with a case where you are not seeing this? As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O rate drops off quickly when you add processes while reading the same blocks from the same file at the same time. I don't know why this is, and it would be helpful if someone explained it to me. UFS readahead isn't MT-aware - it starts trashing when multiple threads perform reads of the same blocks. UFS readahead only works if it's a single thread per file, as the readahead state, i_nextr, is per-inode (and not a per-thread) state. Multiple concurrent readers trash this for each other, as there's only one-per-file. To qualify 'trashing', this means UFS looses tracks of the access, considers workload as random and so does not do any readahead. ZFS did a lot better. There did not appear to be any drop-off after the first process. There was a drop in I/O rate as I kept adding processes, but in that case the CPU was at 100%. I haven't had a chance to test this on a bigger box, but I suspect ZFS is able to keep the sequential read going at full speed (at least if the blocks happen to be written sequentially). ZFS caches multiple readahead states - see the leading comment in usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace. The vdev_cache is where you have the low level device level prefetch (I/O for 8K, read 64K of whatever happens to be under the disk head). dmu_zfetch.c is where the logical prefetching occurs. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] understanding zfs/thunoer bottlenecks?
Jens Elkner writes: Currently I'm trying to figure out the best zfs layout for a thumper wrt. to read AND write performance. I did some simple mkfile 512G tests and found out, that per average ~ 500 MB/s seems to be the maximum on can reach (tried initial default setup, all 46 HDDs as R0, etc.). That might be a per pool limitation due to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6460622 This performance feature was fixed in Nevada last week. Workaround is to create multiple pools with fewer disks. Also this http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6415647 is degrading a bit the perf (guesstimate of anywhere up to 10-20%). -r According to http://www.amd.com/us-en/assets/content_type/DownloadableAssets/ArchitectureWP_062806.pdf I would assume, that much more and at least in theory a max. ~ 2.5 GB/s should be possible with R0 (assuming the throughput for a single thumper HDD is ~ 54 MB/s)... Is somebody able to enlighten me? Thanx, jel. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Perforce on ZFS
So Jonathan, you have a concern about the on-disk space efficiency for small file (more or less subsector). It is a problem that we can throw rust at. I am not sure if this is the basis of Claude's concern though. Creating small files, last week I did a small test. With ZFS I can create 4600 files _and_ sync up the pool to disk and saw no more than 500 I/Os. I'm no FS expert but this looks absolutely amazing to me (ok, I'm rather enthousiastic in general). Logging UFS needs 1 I/O per file (so ~10X more for my test). I don't know where other filesystems are on that metric. I also pointed out that ZFS is not too CPU efficient at tiny write(2) syscalls. But this inefficiency rescinds around 8K writes. This here is a CPU benchmark (I/O is non-factor) : CHUNK ZFS vz UFS 1B 4X slower 1K 2X slower 8K 25% slower 32K equal 64K 30% faster Waiting for a more specific problem statement, I can only stick to what I said, I know of no small file problems with ZFS; If there is one, I'd just like to see the data. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Perforce on ZFS
Sorry to insist but I am not aware of a small file problem with ZFS (which doesn't mean there isn't one, nor that we agree on definition of 'problem'). So if anyone has data on this topic, I'm interested. Also note, ZFS does a lot more than VxFS. -r Claude Teissedre writes: Hello Roch, Thanks for your reply. According to Iozone and Filebench (http://blogs.sun.com/dom/), ZFS is less performant than VxFS for smalll files and more performant for large files. In you blog, I don't see specific infos related to small files -but it's a very interesting blog. Any help from CC: people related to Perforce benchmark (not in techtracker) is welcome. Thanks, Clausde Roch - PAE a écrit : Salut Claude. For this kind of query, try zfs-discuss@opensolaris.org; Looks like a common workload to me. I know of no small file problem with ZFS. You might want to state your metric of success ? -r Claude Teissedre writes: Hello, I am looking for any benchmark of Perforce on ZFS. My need here is specifically for Perforce, a source manager. At my ISV, it handles 250 users simustaneously (15 instances on average) and 16 Millions (small) files. That's an area not covered in the benchmaks I have seen. Thanks, Claude ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!
Leon Koll writes: An update: Not sure is it related to the fragmentation, but I can say that serious performance degradation in my NFS/ZFS benchmarks is a result of on-disk ZFS data layout. Read operations on directories (NFS3 readdirplus) are abnormally time consuming . That kills the server. After cold restart of the host the performans is still on the flour. My conclusion: it's not CPU, not memory, it's ZFS on-disk structures. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss As I understand the issue, a readdirplus is 2X slower when data is already cached in the client than when it is not. Given that the on-disk structure does not change between the 2 runs, I can't really place the fault on it. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS file system supports short writes ?
dudekula mastan writes: If a write call attempted to write X bytes of data, and if writecall writes only x ( hwere x X) bytes, then we call that write as short write. -Masthan What kind of support do you want/need ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Not about Implementing fbarrier() on ZFS
Erblichs writes: Jeff Bonwick, Do you agree that their is a major tradeoff of builds up a wad of transactions in memory? We loose the changes if we have an unstable environment. Thus, I don't quite understand why a 2-phase approach to commits isn't done. First, take the transactions as they come and do a minimal amount of a delayed write. If the number of transactions build up, then convert to the delayed write scheme. I probably don't understand the proposition. It seems that this is about making all writes synchronous and initially go through the Zil and then convert to the pool sync when load builds up ? The problem is that if we make all writes go through the synchronous Zil, this will limit the load greatly in a way that we'll never build a backlog (unless we scale to 1000s of threads). So is this about an option to enable O_DSYNC for all files ? This assumption is that not all ZFS envs are write heavy versus write once and read-many type accesses. My assumption is that attribute/meta reading outweighs all other accesses. Wouldn't this approach allow minimal outstanding transactions and favor read access. Yes, the assumption is that once the wad is started, the amount of writing could be substantial and thus the amount of available bandwidth for reading is reduced. This would then allow for a more N states to be available. Right? So the reads _are_ prioritized over pool writes by the I/O scheduler. But it is correct that the pool sync does impact the read latency atleast on JBOD. There already are suggestions on reducing the impact (reserved read slots, and throttlingwriters,...). Also for the next build the overhead of the pool sync is reduced which opens up the possibility of testing with smaller txg_time. I would be interested to know the problems you have observed to see if we're covered. Second, their are a multiple uses of then: (then pushes, then flushes all disk..., then writes the new uberblock, then flushes the caches again), in which seems to have some level of possible parallelism which should reduce the latency from the start to the final write. Or did you just say that for simplicity sake? The parallelism level of those operations seems very high to me and it was improved last week (for the tail end of the pool sync). But note that the pool sync does not commonly hold up a write or a zil commit. It does so only when the storage is saturated for 10s of seconds. Given that memory is finite we have to throttle applications at some point. -r Mitchell Erblich --- Jeff Bonwick wrote: Toby Thain wrote: I'm no guru, but would not ZFS already require strict ordering for its transactions ... which property Peter was exploiting to get fbarrier() for free? Exactly. Even if you disable the intent log, the transactional nature of ZFS ensures preservation of event ordering. Note that disk caches don't come into it: ZFS builds up a wad of transactions in memory, then pushes them out as a transaction group. That entire group will either commit or not. ZFS writes all the new data to new locations, then flushes all disk write caches, then writes the new uberblock, then flushes the caches again. Thus you can lose power at any point in the middle of committing transaction group N, and you're guaranteed that upon reboot, everything will either be at state N or state N-1. I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool thing is that on ZFS, fbarrier() is a no-op. It's implicit after every system call. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Implementing fbarrier() on ZFS
Peter Schuller writes: I agree about the usefulness of fbarrier() vs. fsync(), BTW. The cool thing is that on ZFS, fbarrier() is a no-op. It's implicit after every system call. That is interesting. Could this account for disproportionate kernel CPU usage for applications that perform I/O one byte at a time, as compared to other filesystems? (Nevermind that the application shouldn't do that to begin with.) I just quickly measured this (overwritting files in CHUNKS); This is a software benchmark (I/O is non-factor) CHUNK ZFS vz UFS 1B 4X slower 1K 2X slower 8K 25% slower 32K equal 64K 30% faster Quick and dirty but I think it paints a picture. I can't really answer your question though. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS vs NFS vs array caches, revisited
The only obvious thing would be if the exported ZFS filesystems where initially mounted at a point in time when zil_disable was non-null. The stack trace that is relevant is: sd_send_scsi_SYNCHRONIZE_CACHE sd`sdioctl+0x1770 zfs`vdev_disk_io_start+0xa0 zfs`zil_flush_vdevs+0x108 zfs`zil_commit_writer+0x2b8 ... You might want to try in turn: dtrace -n 'sd_send_scsi_SYNCHRONIZE_CACHE:[EMAIL PROTECTED](20)]=count()}' dtrace -n 'sdioctl:[EMAIL PROTECTED](20)]=count()}' dtrace -n zil_flush_vdevs:[EMAIL PROTECTED](20)]=count()}' dtrace -n zil_commit_writer:[EMAIL PROTECTED](20)]=count()}' And see if you loose your footing along the way. -r Marion Hakanson writes: [EMAIL PROTECTED] said: [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 times faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b] Well, I do have more info to share on this issue, though how it worked faster in that test still remains a mystery. Folks may recall that I said: Not that I'm not complaining, mind you. I appear to have stumbled across a way to get NFS over ZFS to work at a reasonable speed, without making changes to the array (nor resorting to giving ZFS SVN soft partitions instead of real devices). Suboptimal, mind you, but it's workable if our Hitachi folks don't turn up a way to tweak the array. Unfortunately, I was wrong. I _don't_ know how to make it go fast. While I _have_ been able to reproduce the result on a couple different LUN/slice configurations, I don't know what triggers the fast behavior. All I can say for sure is that a little dtrace one-liner that counts sync-cache calls turns up no such calls (for both local ZFS and remote NFS extracts) when things are going fast on a particular filesystem. By comparison, a local ZFS tar-extraction triggers 12 sync-cache calls, and one hits 288 such calls during an NFS extraction before interrupting the run after 30 seconds (est. 1/100th of the way through) when things are working in the slow mode. Oh yeah, here's the one-liner (type in the command, run your test in another session, then hit ^C on this one): dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry'[EMAIL PROTECTED] = count()}' This is my first ever use of dtrace, so please be gentle with me (:-). [EMAIL PROTECTED] said: Guess I should go read the ZFS source code (though my 10U3 surely lags the Opensolaris stuff). I did go read the source code, for my own edification. To reiterate what was said earlier: [EMAIL PROTECTED] said: The point is that the flushes occur whether or not ZFS turned the caches on or not (caches might be turned on by some other means outside the visibility of ZFS). My limited reading of ZFS (on opensolaris.org site) code so far has turned up no obvious way to make ZFS skip the sync-cache call. However my dtrace test, unless it's flawed, shows that on some filesystems, the call is made, and on other filesystems the call is not made. [EMAIL PROTECTED] said: 2.I never saw the storage controller with cache-per-LUN setting. Cache size doesn't depend on number of LUNs IMHO, it's a fixed size per controller or per FC port, SAN-experts-please-fix-me-if-I'm-wrong. Robert has already mentioned array cache being reserved on a per-LUN basis in Symmetrix boxes. Our low-end HDS unit also has cache pre-fetch settings on a per-LUN basis (defaults according to number of disks in RAID-group). Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: ZFS vs NFS vs array caches, revisited
On x86 try with sd_send_scsi_SYNCHRONIZE_CACHE Leon Koll writes: Hi Marion, your one-liner works only on SPARC and doesn't work on x86: # dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry'[EMAIL PROTECTED] = count()}' dtrace: invalid probe specifier fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:[EMAIL PROTECTED] = count()}: probe description fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry does not match any probes What's wrong with it? Thanks, [i]-- leon[/i] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Robert Milkowski writes: bash-3.00# dtrace -n fbt::txg_quiesce:return'{printf(%Y ,walltimestamp);}' dtrace: description 'fbt::txg_quiesce:return' matched 1 probe CPU IDFUNCTION:NAME 3 38168 txg_quiesce:return 2007 Feb 12 14:08:15 0 38168 txg_quiesce:return 2007 Feb 12 14:12:14 3 38168 txg_quiesce:return 2007 Feb 12 14:15:05 ^C Why I do not see it exactly every 5s? On other server I get output exactly every 5s. I am not sure about this specific funtion but if the question is the same as why is the pool synching more often than 5sec, then that can be because of low memory condition (if we have too much dirty memory to sync we don't wait the 5 seconds.). See arc_tempreserve_space around (ERESTART). -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?
Duh!. Long sync (which delays the next sync) are also possible on a write intensive workloads. Throttling heavy writters, I think, is the key to fixing this. Robert Milkowski writes: Hello Roch, Monday, February 12, 2007, 3:19:23 PM, you wrote: RP Robert Milkowski writes: bash-3.00# dtrace -n fbt::txg_quiesce:return'{printf(%Y ,walltimestamp);}' dtrace: description 'fbt::txg_quiesce:return' matched 1 probe CPU IDFUNCTION:NAME 3 38168 txg_quiesce:return 2007 Feb 12 14:08:15 0 38168 txg_quiesce:return 2007 Feb 12 14:12:14 3 38168 txg_quiesce:return 2007 Feb 12 14:15:05 ^C Why I do not see it exactly every 5s? On other server I get output exactly every 5s. RP I am not sure about this specific funtion but if the RP question is the same as why is the pool synching more often RP than 5sec, then that can be because of low memory condition RP (if we have too much dirty memory to sync we don't wait the RP 5 seconds.). See arc_tempreserve_space around (ERESTART). The opposite - why it's not syncing every 5s and rather every few minutes on that server. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RealNeel : ZFS and DB performance
It's just a matter of time before ZFS overtakes UFS/DIO for DB loads, See Neel's new blog entry: http://blogs.sun.com/realneel/entry/zfs_and_databases_time_for -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] se3510 and ZFS
Robert Milkowski writes: Hello Jonathan, Tuesday, February 6, 2007, 5:00:07 PM, you wrote: JE On Feb 6, 2007, at 06:55, Robert Milkowski wrote: Hello zfs-discuss, It looks like when zfs issues write cache flush commands se3510 actually honors it. I do not have right now spare se3510 to be 100% sure but comparing nfs/zfs server with se3510 to another nfs/ufs server with se3510 with Periodic Cache Flush Time set to disable or so longer time I can see that cache utilization on nfs/ufs stays about 48% while on nfs/zfs it's hardly reaches 20% and every few seconds goes down to 0 (I guess every txg_time). nfs/zfs also has worse performance than nfs/ufs. Does anybody know how to tell se3510 not to honor write cache flush commands? JE I don't think you can .. DKIOCFLUSHWRITECACHE *should* tell the array JE to flush the cache. Gauging from the amount of calls that zfs makes to JE this vs ufs (fsck, lockfs, mount?) - i think you'll see the JE performance diff, JE particularly when you hit an NFS COMMIT. (If you don't use vdevs you JE may see another difference in zfs as the only place you'll hit is on JE the zil) IMHO it definitely shouldn't actually. The array has two controllers and write cache is mirrored. Also this is not the only host using that array. You can actually win much of a performance, especially with nfs/zfs setup (lot of synchronous ops) I guess. Again it's a question of semantic. The intent of ZFS is to say put the bits on stable storage. The controller then decides if the NVRAM qualifies as stable storage (is dual ported, batteries are up) and can ignore the request. If the battery charge gets low then the array needs to honor the flush request. No way for ZFS to adjust to array battery charge. So I'd argue the DKIOCFLUSHWRITECACHE is misnamed. The work going on is to allow the DKIOCFLUSHWRITECACHE to be qualified to mean flush to rust which I guess won't be used by ZFS or flush to stable storage; if NVRAM is considered stable enough by the array, then it will turn the request into a noop. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Actual (cache) memory use of ZFS?
Bjorn Munch writes: Hello, I am doing some tests using ZFS for the data files of a database system, and ran into memory problems which has been discussed in a thread here a few weeks ago. When creating a new database, the data files are first initialized to their configured size (written in full), then the servers are started. They will then need to allocate shared memory for database cache. I am running two database nodes per host, trying to use 512Mb memory each. They are using so-called Intimate Shared Memory which requires that the requested amount is available in physical memory. Since ZFS has just gobbled up memory for cache, it is not available and the database won't start. This was on a host with 2Gb memory. That seems like a bug. ZFS is designed to release memory upon demand by the DB. Which OS was this running ? Could be related to : MrNumber: 4034947 Synopsis: anon_swap_adjust(), anon_resvmem() should call kmem_reap() if availrmem is low. Fixed in snv_42 -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thumper Origins Q
Nicolas Williams writes: On Tue, Jan 30, 2007 at 06:32:14PM +0100, Roch - PAE wrote: The only benefit of using a HW RAID controller with ZFS is that it reduces the I/O that the host needs to do, but the trade off is that ZFS cannot do combinatorial parity reconstruction so that it could only detect errors, not correct them. It would be cool if the host could offload the RAID I/O to a HW controller but still be able to read the individual stripes to perform combinatorial parity reconstruction. right but in this situation, if the cosmic ray / bit flip hits on the way to the controller, the array will store wrong data and we will not be able to reconstruct the correct block. So having multiple I/Os here improves the time to data loss metric. You missed my point. Assume _new_ RAID HW that allows the host to read the individual stripes. The ZFS could offload I/O to the RAID HW but, when a checksum fails to validate on read, THEN go read the individual stripes and parity and do the combinatorial reconstruction as if the RAID HW didn't exist. I don't believe such RAID HW exists, therefore the point is moot. But if such HW ever comes along... Nico -- I think I got the point. Mine was that if the data travels a single time toward the storage and is corrupted along the way then there will be no hope of recovering it since the array was given bad data. Having the data travel twice is a benefit for MTTDL. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS or UFS - what to do?
Anantha N. Srirama writes: Agreed, I guess I didn't articulate my point/thought very well. The best config is to present JBoDs and let ZFS provide the data protection. This has been a very stimulating conversation thread; it is shedding new light into how to best use ZFS. I would say: To enable the unique ZFS feature of self-healing ZFS must be allowed to manage a level of redundancy: mirroring or Raid-z. The type of LUNs (JBOD/Raid-*/iscsi) used is not relevant in this statement. Now, if one also relies on ZFS to reconstruct data in the face of disk failures (as opposed tostorage based reconstruction), better make sure that single/double disk failures do not bring down multiple LUNS at once. So better protection is achieved by configuring LUNS that maps to seggregated sets of physical things (disks controllers). -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
[EMAIL PROTECTED] writes: Note also that for most applications, the size of their IO operations would often not match the current page size of the buffer, causing additional performance and scalability issues. Thanks for mentioning this, I forgot about it. Since ZFS's default block size is configured to be larger than a page, the application would have to issue page-aligned block-sized I/Os. Anyone adjusting the block size would presumably be responsible for ensuring that the new size is a multiple of the page size. (If they would want Direct I/O to work...) I believe UFS also has a similar requirement, but I've been wrong before. I believe the UFS requirement is that the I/O be sector aligned for DIO to be attempted. And Anton did mention that one of the benefit of DIO is the ability to direct-read a subpage block. Without UFS/DIO the OS is required to read and cache the full page and the extra amount of I/O may lead to data channel saturation (I don't see latency as an issue in here, right ?). This is where I said that such a feature would translate for ZFS into the ability to read parts of a filesystem block which would only make sense if checksums are disabled. And for RAID-Z that could mean avoiding I/Os to each disks but one in a group, so that's a nice benefit. So for the performance minded customer that can't afford mirroring, is not much a fan of data integrity, that needs to do subblock reads to an uncacheable workload, then I can see a feature popping up. And this feature is independant on whether or not the data is DMA'ed straight into the user buffer. The other feature, is to avoid a bcopy by DMAing full filesystem block reads straight into user buffer (and verify checksum after). The I/O is high latency, bcopy adds a small amount. The kernel memory can be freed/reuse straight after the user read completes. This is where I ask, how much CPU is lost to the bcopy in workloads that benefit from DIO ? At this point, there are lots of projects that will lead to performance improvements. The DIO benefits seems like small change in the context of ZFS. The quickest return on investement I see for the directio hint would be to tell ZFS to not grow the ARC when servicing such requests. -r -j ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Heavy writes freezing system
If some aspect of the load is writing large amount of data into the pool (through the memory cache, as opposed to the zil) and that leads to a frozen system, I think that a possible contributor should be: |6429205||each zpool needs to monitor its throughput and throttle heavy writers| -r Anantha N. Srirama writes: Bug 6413510 is the root cause. ZFS maestros please correct me if I'm quoting an incorrect bug. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Jason J. W. Williams writes: Hi Anantha, I was curious why segregating at the FS level would provide adequate I/O isolation? Since all FS are on the same pool, I assumed flogging a FS would flog the pool and negatively affect all the other FS on that pool? Best Regards, Jason Good point, If the problem is 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files Then the seggegration to 2 filesystem on the same pool will help. But if the problem is more like 6429205 each zpool needs to monitor its throughput and throttle heavy writers then it 2 FS won't help. 2 pools probably would though. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS direct IO
Jonathan Edwards writes: On Jan 5, 2007, at 11:10, Anton B. Rang wrote: DIRECT IO is a set of performance optimisations to circumvent shortcomings of a given filesystem. Direct I/O as generally understood (i.e. not UFS-specific) is an optimization which allows data to be transferred directly between user data buffers and disk, without a memory-to-memory copy. This isn't related to a particular file system. true .. directio(3) is generally used in the context of *any* given filesystem to advise it that an application buffer to system buffer copy may get in the way or add additional overhead (particularly if the filesystem buffer is doing additional copies.) You can also look at it as a way of reducing more layers of indirection particularly if I want the application overhead to be higher than the subsystem overhead. Programmatically .. less is more. Direct IO makes good sense when the target disk sectors are set a priori. But in the context of ZFS, would you rather have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that was possible). As for read, I can see that when the load is cached in the disk array and we're running 100% CPU, the extra copy might be noticeable. Is this the situation that longs for DIO ? What % of a system is spent in the copy ? What is the added latency that comes from the copy ? Is DIO the best way to reduce the CPU cost of ZFS ? The current Nevada code base has quite nice performance characteristics (and certainly quirks); there are many further efficiency gains to be reaped from ZFS. I just don't see DIO on top of that list for now. Or at least someone needs to spell out what is ZFS/DIO and how much better it is expected to be (back of the envelope calculation accepted). Reading RAID-Z subblocks on filesystems that have checksum disabled might be interesting. That would avoid some disk seeks.To served the subblocks directly or not is a separate matter; it's a small deal compared to the feature itself. How about disabling the DB checksum (it can't fix the block anyway) and do mirroring ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS and ZFS, a fine combination
Dennis Clarke writes: On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote: http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine So just to confirm; disabling the zil *ONLY* breaks the semantics of fsync() and synchronous writes from the application perspective; it will do *NOTHING* to lessen the correctness guarantee of ZFS itself, including in the case of a power outtage? That is correct. ZFS, with or without the ZIL, will *always* maintain consistent on-disk state and will *always* preserve the ordering of events on-disk. That is, if an application makes two changes to the filesystem, first A, then B, ZFS will *never* show B on-disk without also showing A. So then, this begs the question Why do I want this ZIL animal at all? You said correctness guarantee Bill said ...consistent on-disk state The ZIL is not necessary for ZFS to keep it's on-disk format consistent. However the ZIL is necessary/essential to provide synchonous semantics to application. Without a ZIL fsync() and the like become a NO-OP; it's a very uncommon requirement altough one that does exists. But for ZFS to be a correct Filesystem, the ZIL is necessary and provides an excellent service. My article shows that ZFS with the ZIL can be better than UFS (which uses it's own logging scheme). -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss