Re: [nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Robert Milkowski wrote: Hello eric, Saturday, December 9, 2006, 7:07:49 PM, you wrote: ek Jim Mauro wrote: Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? It's not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. ek Actually i forgot he had 'zil_disable' turned on, so it won't matter in ek this case. Ben, are you sure zil_disable was set to 1 BEFORE pool was imported? Yes, absolutely. Set var in /etc/system, reboot, system come up. That happened almost 2 months ago, long before this lock insanity problem popped up. To be clear, the ZIL issue was a problem for creation of a handful of files of any size. Untar'ing a file was a massive performance drain. This issue, other the other hand, deals with thousands of little files being created all the time (IMAP Locks). These are separate issues from my point of view. With ZIL slowness NFS performance was just slow but we didn't see massive CPU usage, with this issue on the other hand we were seeing waves in 10 second-ish cycles where the run queue would go sky high with 0 idle. Please see the earlier mails for examples of the symptoms. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Hello Ben, Monday, December 11, 2006, 9:34:18 PM, you wrote: BR Robert Milkowski wrote: Hello eric, Saturday, December 9, 2006, 7:07:49 PM, you wrote: ek Jim Mauro wrote: Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? It's not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. ek Actually i forgot he had 'zil_disable' turned on, so it won't matter in ek this case. Ben, are you sure zil_disable was set to 1 BEFORE pool was imported? BR Yes, absolutely. Set var in /etc/system, reboot, system come up. That BR happened almost 2 months ago, long before this lock insanity problem BR popped up. BR To be clear, the ZIL issue was a problem for creation of a handful of BR files of any size. Untar'ing a file was a massive performance drain. BR This issue, other the other hand, deals with thousands of little files BR being created all the time (IMAP Locks). These are separate issues from BR my point of view. With ZIL slowness NFS performance was just slow but BR we didn't see massive CPU usage, with this issue on the other hand we BR were seeing waves in 10 second-ish cycles where the run queue would go BR sky high with 0 idle. Please see the earlier mails for examples of the BR symptoms. Ok. And just another question - is nfs file system mounted with noac options on imap server and application is doing chdir() to nfs directories? I'm not sure if it's fixed - if not your nfs client on every chdir() will generate lot of small traffic to nfs server starving it of cpu. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
BR Yes, absolutely. Set var in /etc/system, reboot, system come up. That BR happened almost 2 months ago, long before this lock insanity problem BR popped up. For the archives, a high level of lock activity can always be a problem. The worst cases I've experienced were with record locking over NFS. The worst offenders were programs written with a local file system in mind, especially PC-based applications. This has become one of the first things I check in such instances, and the network traffic will be consistently full of lock activity. Just one of those things you need to watch out for in a networked world :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
On Fri, Ben Rockwood wrote: eric kustarz wrote: So i'm guessing there's lots of files being created over NFS in one particular dataset? We should figure out how many creates/second you are doing over NFS (i should have put a timeout on the script). Here's a real simple one (from your snoop it looked like you're only doing NFSv3, so i'm not tracking NFSv4): #!/usr/sbin/dtrace -s rfs3_create:entry, zfs_create:entry { @creates[probefunc] = count(); } tick-60s { exit(0); } Eric, I love you. Running this bit of DTrace reveled more than 4,000 files being created in almost any given 60 second window. And I've only got one system that would fit that sort of mass file creation: our Joyent Connector products Courier IMAP server which uses Maildir. As a test I simply shutdown Courier and unmounted the mail NFS share for good measure and sure enough the problem vanished and could not be reproduced. 10 minutes later I re-enabled Courier and our problem came back. Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. You asked for the zpool and zfs info, which I don't want to share because its confidential (if you want it privately I'll do so, but not on a public list), but I will say that its a single massive Zpool in which we're using less than 2% of the capacity. But in thinking about this problem, even if we used 2 or more pools, the CPU consumption still would have choked the system, right? This leaves me really nervous about what we'll do when its not an internal mail server thats creating all those files but a customer. Oddly enough, this might be a very good reason to use iSCSI instead of NFS on the Thumper. Eric, I owe you a couple cases of beer for sure. I can't tell you how much I appreciate your help. Thanks to everyone else who chimed in with ideas and suggestions, all of you guys are the best! Good to hear that you have figured out what is happening, Ben. For future reference, there are two commands that you may want to make use of in observing the behavior of the NFS server and individual filesystems. There is the trusty, nfsstat command. In this case, you would have been able to do something like: nfsstat -s -v3 60 This will provide all of the server side NFSv3 statistics on 60 second intervals. Then there is a new command fsstat that will provide vnode level activity on a per filesystem basis. Therefore, if the NFS server has multiple filesystems active and you want ot look at just one something like this can be helpful: fsstat /export/foo 60 Fsstat has a 'full' option that will list all of the vnode operations or just certain types. It also will watch a filesystem type (e.g. zfs, nfs). Very useful. Spencer ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote: Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. For local tests, this is not true of ZFS. It seems that file creation only swamps us when coming over NFS. We can do thousands of files a second on a Thumper with room to spare if NFS isn't involved. Next step is to figure out why NFS kills us. --Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Spencer Shepler wrote: Good to hear that you have figured out what is happening, Ben. For future reference, there are two commands that you may want to make use of in observing the behavior of the NFS server and individual filesystems. There is the trusty, nfsstat command. In this case, you would have been able to do something like: nfsstat -s -v3 60 This will provide all of the server side NFSv3 statistics on 60 second intervals. Then there is a new command fsstat that will provide vnode level activity on a per filesystem basis. Therefore, if the NFS server has multiple filesystems active and you want ot look at just one something like this can be helpful: fsstat /export/foo 60 Fsstat has a 'full' option that will list all of the vnode operations or just certain types. It also will watch a filesystem type (e.g. zfs, nfs). Very useful. NFSstat I've been using, but fsstat I was unaware of. Which I'd used it rather than duplicated most of its functionality with D script. :) Thanks for the tip. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Bill Moore wrote: On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote: Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. For local tests, this is not true of ZFS. It seems that file creation only swamps us when coming over NFS. We can do thousands of files a second on a Thumper with room to spare if NFS isn't involved. Next step is to figure out why NFS kills us. Agreed. If mass file creation was a problem locally I'd think that we'd have people beating down the doors with complaints. One thought I had as a work around was to move all my mail on NFS to an iSCSI LUN and then put a Zpool on that. I'm willing to bet that'd work fine. Hopefully I can try it. To round out the discussion, the root cause of this whole mess was Courier IMAP Locking. After isolating the problem last night and writing a little d script to find out what files were being create it was obviously lock files, turn off locking and file creations dropped to a reasonable level and our problem vanished. If I can help at all with testing or analysis please let me know. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Ben Rockwood wrote: Bill Moore wrote: On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote: Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. For local tests, this is not true of ZFS. It seems that file creation only swamps us when coming over NFS. We can do thousands of files a second on a Thumper with room to spare if NFS isn't involved. Next step is to figure out why NFS kills us. Agreed. If mass file creation was a problem locally I'd think that we'd have people beating down the doors with complaints. One thought I had as a work around was to move all my mail on NFS to an iSCSI LUN and then put a Zpool on that. I'm willing to bet that'd work fine. Hopefully I can try it. Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? To round out the discussion, the root cause of this whole mess was Courier IMAP Locking. After isolating the problem last night and writing a little d script to find out what files were being create it was obviously lock files, turn off locking and file creations dropped to a reasonable level and our problem vanished. Gotta love that D. If I can help at all with testing or analysis please let me know. I want to create a test case in-house first and see if its reproducable. eric ps: i like lots of different types of beers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? It's not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. File creates are metadata intensive, right? And these operations need to be synchronous to guarantee file system consistency (yes, I am familiar with the ZFS COW model). AnywayI'm feeling rather naive' here, but I've seen the NFS enforced synchronous semantics phrase kicked around many times as the explanation for suboptimal performance for metadata-intensive operations when ZFS is the underlying file system, but I never really understood what's unsynchronous about doing the same thing to a local ZFS. And yes, there is certainly a network latency component to the NFS configuration, so for any synchronous operation, I would expect things to be slower when done over NFS. Awaiting enlightment :^) /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Jim Mauro wrote: Could be NFS synchronous semantics on file create (followed by repeated flushing of the write cache). What kind of storage are you using (feel free to send privately if you need to) - is it a thumper? It's not clear why NFS-enforced synchronous semantics would induce different behavior than the same load to a local ZFS. Actually i forgot he had 'zil_disable' turned on, so it won't matter in this case. The big difference is that the NFS server does a VOP_FSYNC on each file create as it needs to guarantee that the changes are on stable storage. Locally, we don't need that same guarantee so you don't have to wait for a disk I/O to complete on each create. eric File creates are metadata intensive, right? And these operations need to be synchronous to guarantee file system consistency (yes, I am familiar with the ZFS COW model). AnywayI'm feeling rather naive' here, but I've seen the NFS enforced synchronous semantics phrase kicked around many times as the explanation for suboptimal performance for metadata-intensive operations when ZFS is the underlying file system, but I never really understood what's unsynchronous about doing the same thing to a local ZFS. And yes, there is certainly a network latency component to the NFS configuration, so for any synchronous operation, I would expect things to be slower when done over NFS. Awaiting enlightment :^) /jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
On Dec 9, 2006, at 8:59 , Jim Mauro wrote: AnywayI'm feeling rather naive' here, but I've seen the NFS enforced synchronous semantics phrase kicked around many times as the explanation for suboptimal performance for metadata-intensive operations when ZFS is the underlying file system, but I never really understood what's unsynchronous about doing the same thing to a local ZFS If I remember correctly, the difference is that NFS requires that the operation be committed to stable storage before the return to the client. This is definitely a heavier operation than the local case, where the return to the caller may happen as soon as the operation is cached. If there's a crash, the local system does not guarantee to the caller that the operation is on disk, but NFS does. Both guarantee consistency but NFS makes stronger guarantees of completeness. --Ed ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
On Dec 9, 2006, at 8:59 , Jim Mauro wrote: AnywayI'm feeling rather naive' here, but I've seen the NFS enforced synchronous semantics phrase kicked around many times as the explanation for suboptimal performance for metadata-intensive operations when ZFS is the underlying file system, but I never really understood what's unsynchronous about doing the same thing to a local ZFS If I remember correctly, the difference is that NFS requires that the operation be committed to stable storage before the return to the client. This is definitely a heavier operation than the local case, where the return to the caller may happen as soon as the operation is cached. If there's a crash, the local system does not guarantee to the caller that the operation is on disk, but NFS does. Both guarantee consistency but NFS makes stronger guarantees of completeness. The currenet way to solve this from the NFS server's perspective is to call something equivalent to sync(). It was my understanding that disable ZIL would make such operations more or less asynchronous but it still seems that NFS is doing a sync() like thing. Perhaps there should be a different mechanism for making VOP_* calls, calls which do not return until the specific call was committed to stable storage. (Interested parties get notified when such events happen). Unfortunately, that would cause the NFS client system calls to sleep until the calls are completed (as they do now), but there doesn't seem to be a need for that; writes currently need a specific commit and it would perhaps be worthwhile to investigate more asynchronous behaviour. As long as the client remembers the non-committed calls, there is no data loss issue; when the client crashes, you expect data loss; when the server crashes, the client resends. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
eric kustarz wrote: So i'm guessing there's lots of files being created over NFS in one particular dataset? We should figure out how many creates/second you are doing over NFS (i should have put a timeout on the script). Here's a real simple one (from your snoop it looked like you're only doing NFSv3, so i'm not tracking NFSv4): #!/usr/sbin/dtrace -s rfs3_create:entry, zfs_create:entry { @creates[probefunc] = count(); } tick-60s { exit(0); } Eric, I love you. Running this bit of DTrace reveled more than 4,000 files being created in almost any given 60 second window. And I've only got one system that would fit that sort of mass file creation: our Joyent Connector products Courier IMAP server which uses Maildir. As a test I simply shutdown Courier and unmounted the mail NFS share for good measure and sure enough the problem vanished and could not be reproduced. 10 minutes later I re-enabled Courier and our problem came back. Clearly ZFS file creation is just amazingly heavy even with ZIL disabled. If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron cores we're in big trouble in the longer term. In the meantime I'm going to find a new home for our IMAP Mail so that the other things served from that NFS server at least aren't effected. You asked for the zpool and zfs info, which I don't want to share because its confidential (if you want it privately I'll do so, but not on a public list), but I will say that its a single massive Zpool in which we're using less than 2% of the capacity. But in thinking about this problem, even if we used 2 or more pools, the CPU consumption still would have choked the system, right? This leaves me really nervous about what we'll do when its not an internal mail server thats creating all those files but a customer. Oddly enough, this might be a very good reason to use iSCSI instead of NFS on the Thumper. Eric, I owe you a couple cases of beer for sure. I can't tell you how much I appreciate your help. Thanks to everyone else who chimed in with ideas and suggestions, all of you guys are the best! benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Hey Ben - I need more time to look at this and connect some dots, but real quick Some nfsstat data that we could use to potentially correlate to the local server activity would be interesting. zfs_create() seems to be the heavy hitter, but a periodic kernel profile (especially if we can catch a 97% SYS period) would help: #lockstat -i997 -Ik -s 10 sleep 60 Alternatively: #dtrace -n 'profile-997hz / arg0 != 0 / { @s[stack()]=count(); }' It would also be interesting to see what the zfs_create()'s are doing. Perhaps a quick: #dtrace -n 'zfs_create:entry { printf(ZFS Create: %s\n, stringof(args[0]-v_path)); }' It would also be interesting to see the network stats. Grab Brendan's nicstat and collect some samples You're reference to low traffic is in bandwidth, which, as you indicate, is really, really low. But the data, at least up to this point, suggests the workload is not data/bandwidth intensive, but more attribute intensive. Note again zfs_create() is the heavy ZFS function, along with zfs_getattr. Perhaps it's the attribute-intensive nature of the load that is at the root of this. I can spend more time on this tomorrow (traveling today). Thanks, /jim Ben Rockwood wrote: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking at the
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Ben, The attached dscript might help determining the zfs_create issue. It prints: - a count of all functions called from zfs_create - average wall count time of the 30 highest functions - average cpu time of the 30 highest functions Note, please ignore warnings of the following type: dtrace: 1346 dynamic variable drops with non-empty dirty list Neil. Ben Rockwood wrote On 12/07/06 06:01,: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: 3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed dns was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006 10:38:05%usr%sys%wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 -- 10:38:23 0 94 0 6 -- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04%usr%sys%wio %idleDelta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) 25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:5127.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:1319.8 99 0.0 0 10:46:2317.7 99 0.0 0 10:46:3424.4 99 0.0 0 10:46:4122.1 97 0.0 0 10:46:4813.0 96 0.0 0 10:46:5525.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 00 00 324 224000 1540 00 100 0 0 10 00 1140 2260 10 130860 1 0 99 20 00 162 138 1490540 00
Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43
Ben Rockwood wrote: Eric Kustarz wrote: Ben Rockwood wrote: I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat: We made several performance fixes in the NFS/ZFS area in recent builds, so if possible it would be great to upgrade you from snv_43. That said, there might be something else going on that we haven't accounted for. ... Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous In our performance testing, we haven't found checksums to be anywhere near a large consumer of CPU, so i would recommend leaving that on (due to its benefits). I suspect your apps/clients don't depend on atime, so i think its a good idea to turn that off. We've gotten better NFS performance with this off. More of a heads up as it sounds like compression on/off isn't your problem. If you are not getting good I/O BW with compression turned on, its most likely due to: 6460622 zio_nowait() doesn't live up to its name As Jim, mentioned, using lockstat to figure out where your CPU is being spent is the first step. I've been using 'lockstat -kgIW -D 60 sleep 60'. That collects data for the top 60 callers for a 1 minute period. If you see 'mutex_enter' high up in the results, then we have at least mutex lock contention. ... Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: [private:/tmp] root# prstat -mL PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 The high SYS times being charged to the userland nfsd threads is representative of what the kernel threads are doing (which is most likely going to be NFS and ZFS). Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: [private:/] root# dtrace -s zvop_times.d dtrace: script 'zvop_times.d' matched 66 probes ^C CPU IDFUNCTION:NAME 1 2 :END ZFS COUNT zfs_getsecattr4 zfs_space70 zfs_rename 111 zfs_readdir 284 zfs_read367 zfs_mkdir 670 zfs_setattr1054 zfs_frlock 1562 zfs_putpage3916 zfs_write 4110 zfs_create 4178 zfs_remove 7794 zfs_fid 14960 zfs_inactive 17637 zfs_access20809 zfs_fsync 29668 zfs_lookup31273 zfs_getattr 175457 ZFS AVG TIME zfs_fsync 2337 zfs_getattr2735 zfs_access 2774 zfs_fid2948 zfs_inactive