Re: NFS server bottlenecks
Ivan Voras wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. Oh, I realized that, if you are testing 9/stable (and not head), that you won't have r227809. Without that, all reads on a given file will be serialized, because the server will acquire an exclusive lock on the vnode. The patch for r227809 in head is at: http://people.freebsd.org/~rmacklem/lkshared.patch This should apply fine to a 9 system (but not 8.n), I think. Good luck with it and have fun, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 23, 2012, at 2:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Ivan Voras wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. Oh, I realized that, if you are testing 9/stable (and not head), that you won't have r227809. Without that, all reads on a given file will be serialized, because the server will acquire an exclusive lock on the vnode. The patch for r227809 in head is at: http://people.freebsd.org/~rmacklem/lkshared.patch This should apply fine to a 9 system (but not 8.n), I think. Good luck with it and have fun, rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org Thanks, I've applied the patch by hand because of some differences and I'm now rebuilding. In case they are still needed here are the dd tests with loopback UDP mount : http://home.totalterror.net/freebsd/nfstest/udp-dd.html Over udp writing degrades much worse... ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 18, 2012, at 6:11 PM, Nikolay Denev nde...@gmail.com wrote: On Oct 15, 2012, at 5:34 PM, Ivan Voras ivo...@freebsd.org wrote: On 15 October 2012 16:31, Nikolay Denev nde...@gmail.com wrote: On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. Applied and compiled OK, I will be able to test it tomorrow. Ok, thanks! The differences should be most visible in edge cases with a larger number of nfsd processes (16+) and many CPU cores. I'm now rebooting with your patch, and hopefully will have some results tomorrow. Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. As I think you know, my concern with your patch would be correctness for UDP, not performance.) Anyhow, sounds like you guys are having fun with it and learning some useful things. Keep up the good work, rick But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 20, 2012, at 3:11 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. I've now started this test locally. But from previous different iozone runs, I remember locally the speed was much better, but I will wait for this test to finish, as the comparison will be better. But I think there is still something fishy… I have cases where I have reached 1000MB/s over NFS (from network stats, not local machine stats), but sometimes it is very slow even for file completely in ARC. Rick mentioned that this could be due to RPC overhead and network round trip time, but earlier in this thread I've done a test only on the server by mounting the NFS exported ZFS dataset locally and did some tests with dd: To take the network out of the equation I redid the test by mounting the same filesystem over NFS on the server: [18:23]root@goliath:~# mount -t nfs -o rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576 localhost:/tank/spa_db/undo /mnt [18:24]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec) [18:25]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec) During the first run I saw several nfsd threads in top, along with dd and again zero disk I/O. There was increase in memory usage because of the double buffering ARC-buffercahe. The second run was with all of the nfsd threads totally idle, and read directly from the buffercache. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 20, 2012, at 3:11 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. The first iozone local run finished, I'll paste just the result here, and also the same test over NFS for comparison: (This is iozone doing 8k sized IO ops, on ZFS dataset with recordsize=8k) NFS: random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read 33554432 849735522 2930 290629083886 Local: random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read 33554432 8 34740 41390 135442 142534 24992 12493 P.S.: I forgot to mention that the network is with 9K mtu. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 20, 2012, at 4:00 PM, Nikolay Denev nde...@gmail.com wrote: On Oct 20, 2012, at 3:11 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote: Here are the results from testing both patches : http://home.totalterror.net/freebsd/nfstest/results.html Both tests ran for about 14 hours ( a bit too much, but I wanted to compare different zfs recordsize settings ), and were done first after a fresh reboot. The only noticeable difference seems to be much more context switches with Ivan's patch. Thank you very much for your extensive testing! I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. But, you have also shown that my patch doesn't do any better than Rick's even on a fairly large configuration, so I don't think there's value in adding the extra complexity, and Rick knows NFS much better than I do. But there are a few things other than that I'm interested in: like why does your load average spike almost to 20-ties, and how come that with 24 drives in RAID-10 you only push through 600 MBit/s through the 10 GBit/s Ethernet. Have you tested your drive setup locally (AESNI shouldn't be a bottleneck, you should be able to encrypt well into Gbyte/s range) and the network? If you have the time, could you repeat the tests but with a recent Samba server and a CIFS mount on the client side? This is probably not important, but I'm just curious of how would it perform on your machine. The first iozone local run finished, I'll paste just the result here, and also the same test over NFS for comparison: (This is iozone doing 8k sized IO ops, on ZFS dataset with recordsize=8k) NFS: random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read 33554432 849735522 2930 290629083886 Local: random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read 33554432 8 34740 41390 135442 142534 24992 12493 P.S.: I forgot to mention that the network is with 9K mtu. Here are the full results of the test on the local fs : http://home.totalterror.net/freebsd/nfstest/local_fs/ I'm now running the same test on NFS mount over the loopback interface on the NFS server machine. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote: Ivan Voras wrote: I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? Yes, but are in-kernel context switches also counted? I was assuming they are light-weight enough not to count. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. Another assumption - I thought UDP was the default. As I think you know, my concern with your patch would be correctness for UDP, not performance.) Yes. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote: Ivan Voras wrote: I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? Yes, but are in-kernel context switches also counted? I was assuming they are light-weight enough not to count. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. Another assumption - I thought UDP was the default. As I think you know, my concern with your patch would be correctness for UDP, not performance.) Yes. Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB drives on an LSI controller. Im watching the thread patiently, im kinda looking for results, and answers, Though Im also tempted to run benchmarks on my system also see if i get similar results I also considered that netmap might be one but not quite sure if it would help NFS, since its to hard to tell if its a network bottle neck, though it appears to be network related. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 20, 2012, at 10:45 PM, Outback Dingo outbackdi...@gmail.com wrote: On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote: Ivan Voras wrote: I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? Yes, but are in-kernel context switches also counted? I was assuming they are light-weight enough not to count. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. Another assumption - I thought UDP was the default. As I think you know, my concern with your patch would be correctness for UDP, not performance.) Yes. Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB drives on an LSI controller. Im watching the thread patiently, im kinda looking for results, and answers, Though Im also tempted to run benchmarks on my system also see if i get similar results I also considered that netmap might be one but not quite sure if it would help NFS, since its to hard to tell if its a network bottle neck, though it appears to be network related. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org Doesn't look like network issue to me. From my observations it's more like some overhead in nfs and arc. The boxes easily push 10G with simple iperf test. Running two iperf test over each port of the dual ported 10G nics gives 960MB/sec regardles which machine is the server. Also, I've seen over 960Gb/sec over NFS with this setup, but I can't understand what type of workload was able to do this. At some point I was able to do this with simple dd, then after a reboot I was no longer to push this traffic. I'm thinking something like ARC/kmem fragmentation might be the issue? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Outback Dingo wrote: On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org wrote: On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote: Ivan Voras wrote: I don't know how to interpret the rise in context switches; as this is kernel code, I'd expect no context switches. I hope someone else can explain. Don't the mtx_lock() calls spin for a little while and then context switch if another thread still has it locked? Yes, but are in-kernel context switches also counted? I was assuming they are light-weight enough not to count. Hmm, I didn't look, but were there any tests using UDP mounts? (I would have thought that your patch would mainly affect UDP mounts, since that is when my version still has the single LRU queue/mutex. Another assumption - I thought UDP was the default. TCP has been the default for a FreeBSD client for a long time. It was changed for the old NFS client before I became a committer. (You can explicitly set one or the other as mount options or check via wireshark/tcpdump) As I think you know, my concern with your patch would be correctness for UDP, not performance.) Yes. Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB drives on an LSI controller. Im watching the thread patiently, im kinda looking for results, and answers, Though Im also tempted to run benchmarks on my system also see if i get similar results I also considered that netmap might be one but not quite sure if it would help NFS, since its to hard to tell if its a network bottle neck, though it appears to be network related. NFS network traffic looks very different that a TCP stream (ala bit torrent or ...). I've seen this cause issues before. You can look at a packet trace in wireshark and see if TCP is retransmitting segments. rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 15, 2012, at 5:34 PM, Ivan Voras ivo...@freebsd.org wrote: On 15 October 2012 16:31, Nikolay Denev nde...@gmail.com wrote: On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. Applied and compiled OK, I will be able to test it tomorrow. Ok, thanks! The differences should be most visible in edge cases with a larger number of nfsd processes (16+) and many CPU cores. I'm now rebooting with your patch, and hopefully will have some results tomorrow. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On 13/10/2012 17:22, Nikolay Denev wrote: drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the Linux host. Hi, If you are already testing, could you please also test this patch: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. signature.asc Description: OpenPGP digital signature
Re: NFS server bottlenecks
On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote: On 13/10/2012 17:22, Nikolay Denev wrote: drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the Linux host. Hi, If you are already testing, could you please also test this patch: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. I will try to apply it to RELENG_9 as that's what I'm running and compare the results. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote: On 13/10/2012 17:22, Nikolay Denev wrote: drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the Linux host. Hi, If you are already testing, could you please also test this patch: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. Applied and compiled OK, I will be able to test it tomorrow. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On 15 October 2012 16:31, Nikolay Denev nde...@gmail.com wrote: On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. Applied and compiled OK, I will be able to test it tomorrow. Ok, thanks! The differences should be most visible in edge cases with a larger number of nfsd processes (16+) and many CPU cores. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Saturday, October 13, 2012 9:03:22 am Rick Macklem wrote: rick ps: I hope John doesn't mind being added to the cc list yet again. It's just that I suspect he knows a fair bit about mutex implementation and possible hardware cache line effects. Currently mtx_pool just uses a simple array (I have patches to force the array members to be cache-aligned, but they haven't been shown to help in any benchmarks to date). I do think though that I would prefer embedding the mutexes in the hash table entries directly. This is what we do for the turnstile and sleep queue hash tables. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 13/10/2012 17:22, Nikolay Denev wrote: drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the Linux host. Hi, If you are already testing, could you please also test this patch: http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch I don't think (it is hard to test this) your trim cache algorithm will choose the correct entries to delete. The problem is that UDP entries very seldom time out (unless the NFS server isn't seeing hardly any load) and are mostly trimmed because the size exceeds the highwater mark. With your code, it will clear out all of the entries in the first hash buckets that aren't currently busy, until the total count drops below the high water mark. (If you monitor a busy server with nfsstat -e -s, you'll see the cache never goes below the high water mark, which is 500 by default.) This would delete entries of fairly recent requests. If you are going to replace the global LRU list with ones for each hash bucket, then you'll have to compare the time stamps on the least recently used entries of all the hash buckets and then delete those. If you keep the timestamp of the least recent one for that hash bucket in the hash bucket head, you could at least use that to select which bucket to delete from next, but you'll still need to: - lock that hash bucket - delete a few entries from that bucket's lru list - unlock hash bucket - repeat for various buckets until the count is beloew the high water mark Or something like that. I think you'll find it a lot more work that one LRU list and one mutex. Remember that mutex isn't held for long. Btw, the code looks very nice. (If I was being a style(9) zealot, I'd remind you that it likes return (X); and not return X;. rick It should apply to HEAD without Rick's patches. It's a bit different approach than Rick's, breaking down locks even more. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On 15 October 2012 22:58, Rick Macklem rmack...@uoguelph.ca wrote: The problem is that UDP entries very seldom time out (unless the NFS server isn't seeing hardly any load) and are mostly trimmed because the size exceeds the highwater mark. With your code, it will clear out all of the entries in the first hash buckets that aren't currently busy, until the total count drops below the high water mark. (If you monitor a busy server with nfsstat -e -s, you'll see the cache never goes below the high water mark, which is 500 by default.) This would delete entries of fairly recent requests. You are right about that, if testing by Nikolay goes reasonably well, I'll work on that. If you are going to replace the global LRU list with ones for each hash bucket, then you'll have to compare the time stamps on the least recently used entries of all the hash buckets and then delete those. If you keep the timestamp of the least recent one for that hash bucket in the hash bucket head, you could at least use that to select which bucket to delete from next, but you'll still need to: - lock that hash bucket - delete a few entries from that bucket's lru list - unlock hash bucket - repeat for various buckets until the count is beloew the high water mark Ah, I think I get it: is the reliance on the high watermark as a criteria for cache expiry the reason the list is a LRU instead of an ordinary unordered list? Or something like that. I think you'll find it a lot more work that one LRU list and one mutex. Remember that mutex isn't held for long. It could be, but the current state of my code is just groundwork for the next things I have in plan: 1) Move the expiry code (the trim function) into a separate thread, run periodically (or as a callout, I'll need to talk with someone about which one is cheaper) 2) Replace the mutex with a rwlock. The only thing which is preventing me from doing this right away is the LRU list, since each read access modifies it (and requires a write lock). This is why I was asking you if we can do away with the LRU algorithm. Btw, the code looks very nice. (If I was being a style(9) zealot, I'd remind you that it likes return (X); and not return X;. Thanks, I'll make it more style(9) compliant as I go along. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Ivan Voras wrote: On 15 October 2012 22:58, Rick Macklem rmack...@uoguelph.ca wrote: The problem is that UDP entries very seldom time out (unless the NFS server isn't seeing hardly any load) and are mostly trimmed because the size exceeds the highwater mark. With your code, it will clear out all of the entries in the first hash buckets that aren't currently busy, until the total count drops below the high water mark. (If you monitor a busy server with nfsstat -e -s, you'll see the cache never goes below the high water mark, which is 500 by default.) This would delete entries of fairly recent requests. You are right about that, if testing by Nikolay goes reasonably well, I'll work on that. If you are going to replace the global LRU list with ones for each hash bucket, then you'll have to compare the time stamps on the least recently used entries of all the hash buckets and then delete those. If you keep the timestamp of the least recent one for that hash bucket in the hash bucket head, you could at least use that to select which bucket to delete from next, but you'll still need to: - lock that hash bucket - delete a few entries from that bucket's lru list - unlock hash bucket - repeat for various buckets until the count is beloew the high water mark Ah, I think I get it: is the reliance on the high watermark as a criteria for cache expiry the reason the list is a LRU instead of an ordinary unordered list? Yes, I think you've gt it;-) Have fun with it, rick Or something like that. I think you'll find it a lot more work that one LRU list and one mutex. Remember that mutex isn't held for long. It could be, but the current state of my code is just groundwork for the next things I have in plan: 1) Move the expiry code (the trim function) into a separate thread, run periodically (or as a callout, I'll need to talk with someone about which one is cheaper) 2) Replace the mutex with a rwlock. The only thing which is preventing me from doing this right away is the LRU list, since each read access modifies it (and requires a write lock). This is why I was asking you if we can do away with the LRU algorithm. Btw, the code looks very nice. (If I was being a style(9) zealot, I'd remind you that it likes return (X); and not return X;. Thanks, I'll make it more style(9) compliant as I go along. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: On Fri, 12 Oct 2012 22:05:54 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. I haven't tested this at all, but I think putting all of the mutexes in an array like that is likely to cause cache-line ping-ponging. It may be better to use a pool mutex, or to put the mutexes adjacent in memory to the list heads that they protect. Well, I'll admit I don't know how to do this. What the code does need is a set of mutexes, where any of the mutexes can be referred to by an index. I could easily define a structure that has: struct nfsrc_hashhead { struct nfsrvcachehead head; struct mtx mutex; } nfsrc_hashhead[NFSRVCACHE_HASHSIZE]; - but all that does is leave a small structure between each struct mtx and I wouldn't have thought that would make much difference. (How big is a typical hardware cache line these days? I have no idea.) - I suppose I could waste space and define a glob of unused space between them, like: struct nfsrc_hashhead { struct nfsrvcachehead head; char garbage[N]; struct mtx mutex; } nfsrc_hashhead[NFSRVCACHE_HASHSIZE]; - If this makes sense, how big should N be? (Somewhat less that the length of a cache line, I'd guess. It seems that the structure should be at least a cache line length in size.) All this seems kinda hokey to me and beyond what code at this level should be worrying about, but I'm game to make changes, if others think it's appropriate. I've never use mtx_pool(9) mutexes, but it doesn't sound like they would be the right fit, from reading the man page. (Assuming the mtx_pool_find() is guaranteed to return the same mutex for the same address passed in as an argument, it would seem that they would work, since I can pass nfsrvcachehead[i] in as the pointer arg to index a mutex.) Hopefully jhb@ can say if using mtx_pool(9) for this would be better than an array: struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE]; Does anyone conversant with mutexes know what the best coding approach is? (But I probably won't be able to do the performance testing on any of these for a while. I have a server running the drc2 code but haven't gotten my users to put a load on it yet.) No rush. At this point, the earliest I could commit something like this to head would be December. rick ps: I hope John doesn't mind being added to the cc list yet again. It's just that I suspect he knows a fair bit about mutex implementation and possible hardware cache line effects. -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 13, 2012, at 5:05 AM, Rick Macklem rmack...@uoguelph.ca wrote: I wrote: Oops, I didn't get the readahead option description quite right in the last post. The default read ahead is 1, which does result in rsize * 2, since there is the read + 1 readahead. rsize * 16 would actually be for the option readahead=15 and for readahead=16 the calculation would be rsize * 17. However, the example was otherwise ok, I think? rick I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. These patches are also at: http://people.freebsd.org/~rmacklem/drc2.patch http://people.freebsd.org/~rmacklem/drc3.patch in case the attachments don't get through. rick ps: I haven't tested drc3.patch a lot, but I think it's ok? drc3.patch applied and build cleanly and shows nice improvement! I've done a quick benchmark using iozone over the NFS mount from the Linux host. drc2.pach (but with NFSRVCACHE_HASHSIZE=500) TEST WITH 8K - Auto Mode Using Minimum Record Size 8 KB Using Maximum Record Size 8 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode. OPS Mode. Output is in operations per second. Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O -i 0 -i 1 -i 2 Time Resolution = 0.01 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read fwrite frewrite fread freread 2097152 819191914 2356 232123351706 TEST WITH 1M - Auto Mode Using Minimum Record Size 1024 KB Using Maximum Record Size 1024 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode. OPS Mode. Output is in operations per second. Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o -O -i 0 -i 1 -i 2 Time Resolution = 0.01 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read fwrite frewrite fread freread 20971521024 73 64 477 486 496 61 drc3.patch TEST WITH 8K - Auto Mode Using Minimum Record Size 8 KB Using Maximum Record Size 8 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode. OPS Mode. Output is in operations per second. Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O -i 0 -i 1 -i 2 Time Resolution = 0.01 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewritereadrereadread write read rewrite read fwrite frewrite fread freread 2097152 821082397 3001 301330102389 TEST WITH 1M - Auto Mode Using Minimum Record Size 1024 KB Using Maximum Record Size 1024 KB Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. O_DIRECT feature enabled SYNC Mode. OPS Mode. Output is in operations per second. Command line used: iozone -a -y 1m -q
Re: NFS server bottlenecks
I wrote: Oops, I didn't get the readahead option description quite right in the last post. The default read ahead is 1, which does result in rsize * 2, since there is the read + 1 readahead. rsize * 16 would actually be for the option readahead=15 and for readahead=16 the calculation would be rsize * 17. However, the example was otherwise ok, I think? rick I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. These patches are also at: http://people.freebsd.org/~rmacklem/drc2.patch http://people.freebsd.org/~rmacklem/drc3.patch in case the attachments don't get through. rick ps: I haven't tested drc3.patch a lot, but I think it's ok? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org --- fs/nfsserver/nfs_nfsdcache.c.orig 2012-02-29 21:07:53.0 -0500 +++ fs/nfsserver/nfs_nfsdcache.c 2012-10-03 08:23:24.0 -0400 @@ -164,8 +164,19 @@ NFSCACHEMUTEX; int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0; #endif /* !APPLEKEXT */ +SYSCTL_DECL(_vfs_nfsd); + +static int nfsrc_tcphighwater = 0; +SYSCTL_INT(_vfs_nfsd, OID_AUTO, tcphighwater, CTLFLAG_RW, +nfsrc_tcphighwater, 0, +High water mark for TCP cache entries); +static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER; +SYSCTL_INT(_vfs_nfsd, OID_AUTO, udphighwater, CTLFLAG_RW, +nfsrc_udphighwater, 0, +High water mark for UDP cache entries); + static int nfsrc_tcpnonidempotent = 1; -static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER, nfsrc_udpcachesize = 0; +static int nfsrc_udpcachesize = 0; static TAILQ_HEAD(, nfsrvcache) nfsrvudplru; static struct nfsrvhashhead nfsrvhashtbl[NFSRVCACHE_HASHSIZE], nfsrvudphashtbl[NFSRVCACHE_HASHSIZE]; @@ -781,8 +792,15 @@ nfsrc_trimcache(u_int64_t sockref, struc { struct nfsrvcache *rp, *nextrp; int i; + static time_t lasttrim = 0; + if (NFSD_MONOSEC == lasttrim + nfsrc_tcpsavedreplies nfsrc_tcphighwater + nfsrc_udpcachesize (nfsrc_udphighwater + + nfsrc_udphighwater / 2)) + return; NFSLOCKCACHE(); + lasttrim = NFSD_MONOSEC; TAILQ_FOREACH_SAFE(rp, nfsrvudplru, rc_lru, nextrp) { if (!(rp-rc_flag (RC_INPROG|RC_LOCKED|RC_WANTED)) rp-rc_refcnt == 0 --- fs/nfsserver/nfs_nfsdcache.c.sav 2012-10-10 18:56:01.0 -0400 +++ fs/nfsserver/nfs_nfsdcache.c 2012-10-12 21:04:21.0 -0400 @@ -160,7 +160,8 @@ __FBSDID($FreeBSD: head/sys/fs/nfsserve #include fs/nfs/nfsport.h extern struct nfsstats newnfsstats; -NFSCACHEMUTEX; +extern struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE]; +extern struct mtx nfsrc_udpmtx; int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0; #endif /* !APPLEKEXT */ @@ -208,10 +209,11 @@ static int newnfsv2_procid[NFS_V3NPROCS] NFSV2PROC_NOOP, }; +#define nfsrc_hash(xid) (((xid) + ((xid) 24)) % NFSRVCACHE_HASHSIZE) #define NFSRCUDPHASH(xid) \ - (nfsrvudphashtbl[((xid) + ((xid) 24)) % NFSRVCACHE_HASHSIZE]) + (nfsrvudphashtbl[nfsrc_hash(xid)]) #define NFSRCHASH(xid) \ - (nfsrvhashtbl[((xid) + ((xid) 24)) % NFSRVCACHE_HASHSIZE]) + (nfsrvhashtbl[nfsrc_hash(xid)]) #define TRUE 1 #define FALSE 0 #define NFSRVCACHE_CHECKLEN 100 @@ -262,6 +264,18 @@ static int nfsrc_getlenandcksum(mbuf_t m static void nfsrc_marksametcpconn(u_int64_t); /* + * Return the correct mutex for this cache entry. + */ +static __inline struct mtx * +nfsrc_cachemutex(struct nfsrvcache *rp) +{ + + if ((rp-rc_flag RC_UDP) != 0) + return (nfsrc_udpmtx); + return (nfsrc_tcpmtx[nfsrc_hash(rp-rc_xid)]); +} + +/* * Initialize the server request cache list */ APPLESTATIC void @@ -336,10 +350,12 @@ nfsrc_getudp(struct nfsrv_descript *nd, struct sockaddr_in6 *saddr6; struct nfsrvhashhead *hp; int ret = 0; + struct mtx *mutex; + mutex = nfsrc_cachemutex(newrp); hp = NFSRCUDPHASH(newrp-rc_xid); loop: - NFSLOCKCACHE(); + mtx_lock(mutex); LIST_FOREACH(rp, hp, rc_hash) { if (newrp-rc_xid == rp-rc_xid newrp-rc_proc == rp-rc_proc @@ -347,8 +363,8 @@ loop: nfsaddr_match(NETFAMILY(rp), rp-rc_haddr, nd-nd_nam)) { if ((rp-rc_flag RC_LOCKED) != 0) { rp-rc_flag |= RC_WANTED; -(void)mtx_sleep(rp, NFSCACHEMUTEXPTR, -(PZERO - 1) | PDROP, nfsrc, 10 * hz); +(void)mtx_sleep(rp, mutex, (PZERO - 1) | PDROP, +nfsrc, 10 * hz); goto loop; } if (rp-rc_flag == 0) @@ -358,14 +374,14 @@ loop: TAILQ_INSERT_TAIL(nfsrvudplru, rp, rc_lru); if (rp-rc_flag RC_INPROG) { newnfsstats.srvcache_inproghits++; -NFSUNLOCKCACHE(); +mtx_unlock(mutex); ret = RC_DROPIT; } else if (rp-rc_flag RC_REPSTATUS) { /* * V2 only. */ newnfsstats.srvcache_nonidemdonehits++; -
Re: NFS server bottlenecks
On Fri, 12 Oct 2012 22:05:54 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: I've attached the patch drc3.patch (it assumes drc2.patch has already been applied) that replaces the single mutex with one for each hash list for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. I haven't tested this at all, but I think putting all of the mutexes in an array like that is likely to cause cache-line ping-ponging. It may be better to use a pool mutex, or to put the mutexes adjacent in memory to the list heads that they protect. (But I probably won't be able to do the performance testing on any of these for a while. I have a server running the drc2 code but haven't gotten my users to put a load on it yet.) -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large
Re: NFS server bottlenecks
On Oct 11, 2012, at 7:20 PM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 11, 2012, at 7:20 PM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote: On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all
Re: NFS server bottlenecks
Oops, I didn't get the readahead option description quite right in the last post. The default read ahead is 1, which does result in rsize * 2, since there is the read + 1 readahead. rsize * 16 would actually be for the option readahead=15 and for readahead=16 the calculation would be rsize * 17. However, the example was otherwise ok, I think? rick ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: On Tue, 9 Oct 2012 20:18:00 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: And, although this experiment seems useful for testing patches that try and reduce DRC CPU overheads, most real NFS servers will be doing disk I/O. We don't always have control over what the user does. I think the worst-case for my users involves a third-party program (that they're not willing to modify) that does line-buffered writes in append mode. This uses nearly all of the CPU on per-RPC overhead (each write is three RPCs: GETATTR, WRITE, COMMIT). Yes. My comment was simply meant to imply that his testing isn't a realistic load for most NFS servers. It was not meant to imply that reducing the CPU overhead/lock contention of the DRC is a useless exercise. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Tue, 9 Oct 2012 20:18:00 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: And, although this experiment seems useful for testing patches that try and reduce DRC CPU overheads, most real NFS servers will be doing disk I/O. We don't always have control over what the user does. I think the worst-case for my users involves a third-party program (that they're not willing to modify) that does line-buffered writes in append mode. This uses nearly all of the CPU on per-RPC overhead (each write is three RPCs: GETATTR, WRITE, COMMIT). -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values generate too large a burst of traffic for a network interface to handle and then the rsize/wsize has to be decreased to avoid this issue.) And, although this experiment
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values
Re: NFS server bottlenecks
On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote: Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values generate too large a burst of traffic for a network
Re: NFS server bottlenecks
On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. I've snatched some sample DTrace script from the net : [ http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ] And modified it for our new NFS server : #!/usr/sbin/dtrace -qs fbt:kernel:nfsrvd_*:entry { self-ts = timestamp; @counts[probefunc] = count(); } fbt:kernel:nfsrvd_*:return / self-ts 0 / { this-delta = (timestamp-self-ts)/100; } fbt:kernel:nfsrvd_*:return / self-ts 0 this-delta 100 / { @slow[probefunc, ms] = lquantize(this-delta, 100, 500, 50); }
Re: NFS server bottlenecks
On Oct 9, 2012, at 5:12 PM, Nikolay Denev nde...@gmail.com wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. I've snatched some sample DTrace script from the net : [ http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ] And modified it for our new NFS server : #!/usr/sbin/dtrace -qs fbt:kernel:nfsrvd_*:entry { self-ts = timestamp; @counts[probefunc] = count(); } fbt:kernel:nfsrvd_*:return / self-ts 0 / { this-delta = (timestamp-self-ts)/100; } fbt:kernel:nfsrvd_*:return / self-ts
Re: NFS server bottlenecks
Nikolay Denev wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO involved. Just out of curiousity, why do you use 8K reads instead of 64K reads. Since the RPC overhead (including the DRC functions) is per RPC, doing fewer larger RPCs should usually work better. (Sometimes large rsize/wsize values generate too large a burst of traffic for a network interface to handle and then the rsize/wsize has to be decreased to avoid this issue.) And, although this
Re: NFS server bottlenecks
On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org I was doing some NFS testing with RELENG_9 machine and a Linux RHEL machine over 10G network, and noticed the same nfsd threads issue. Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server with dd if=/tank/32G.bin of=/dev/null bs=1M to cache it completely in ARC (machine has 196G RAM), then if I do this again locally I would get close to 4GB/sec read - completely from the cache... But If I try to read the file over NFS from the Linux machine I would only get about 100MB/sec speed, sometimes a bit more, and all of the nfsd threads are clearly visible in top. pmcstat also showed the same mutex contention as in the original post. I've now applied the drc2 patch, and reruning the same test yields about 960MB/s transfer over NFS… quite an
Re: NFS server bottlenecks
Nikolay Deney wrote: On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote: Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org I was doing some NFS testing with RELENG_9 machine and a Linux RHEL machine over 10G network, and noticed the same nfsd threads issue. Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server with dd if=/tank/32G.bin of=/dev/null bs=1M to cache it completely in ARC (machine has 196G RAM), then if I do this again locally I would get close to 4GB/sec read - completely from the cache... But If I try to read the file over NFS from the Linux machine I would only get about 100MB/sec speed, sometimes a bit more, and all of the nfsd threads are clearly visible in top. pmcstat also showed the same mutex
Re: NFS server bottlenecks
Garrett Wollman wrote: [Adding freebsd-fs@ to the Cc list, which I neglected the first time around...] On Tue, 2 Oct 2012 08:28:29 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: I can't remember (I am early retired now;-) if I mentioned this patch before: http://people.freebsd.org/~rmacklem/drc.patch It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can be twiddled so that the drc is trimmed less frequently. By making these values larger, the trim will only happen once/sec until the high water mark is reached, instead of on every RPC. The tradeoff is that the DRC will become larger, but given memory sizes these days, that may be fine for you. It will be a while before I have another server that isn't in production (it's on my deployment plan, but getting the production servers going is taking first priority). The approaches that I was going to look at: Simplest: only do the cache trim once every N requests (for some reasonable value of N, e.g., 1000). Maybe keep track of the number of entries in each hash bucket and ignore those buckets that only have one entry even if is stale. Well, the patch I have does it when it gets too big. This made sense to me, since the cache is trimmed to keep it from getting too large. It also does the trim at least once/sec, so that really stale entries are removed. Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? A mutex in each element could be used for changes (not insertion/removal) to an individual element. However, the current code manipulates the lists and makes minimal changes to the individual elements, so I'm not sure if a mutex in each element would be useful or not, but it wouldn't help for the trimming case, imho. I modified the patch slightly, so it doesn't bother to acquire the mutex when it is checking if it should trim now. I think this results in a slight risk that the test will use an out of date cached copy of one of the global vars, but since the code isn't modifying them, I don't think it matters. This modified patch is attached and is also here: http://people.freebsd.org/~rmacklem/drc2.patch Moderately complicated: figure out if a different synchronization type can safely be used (e.g., rmlock instead of mutex) and do so. More complicated: move all cache trimming to a separate thread and just have the rest of the code wake it up when the cache is getting too big (or just once a second since that's easy to implement). Maybe just move all cache processing to a separate thread. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. The above patch does it when it is too big or at least once/sec. I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? Isilon did use separate threads (I never saw their code, so I am going by what they told me), but it sounded to me like they were trimming the cache too agressively to be effective for TCP mounts. (ie. It sounded to me like they had broken the algorithm to achieve better perf.) Remember that the DRC is weird, in that it is a cache to improve correctness at the expense of overhead. It never improves performance. On the other hand, turn it off or throw away entries too aggressively and data corruption, due to retries of non-idempotent operations, can be the outcome. Good luck with whatever you choose, rick It's pretty clear from the profile that the cache mutex is heavily contended, so anything that reduces the length of time it's held is probably a win. That URL again, for the benefit of people on freebsd-fs who didn't see it on hackers, is: http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png. (This graph is slightly modified from my previous post as I removed some spurious edges to make the formatting look better. Still looking for a way to get a profile that includes all kernel modules with the kernel.) -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to
Re: NFS server bottlenecks
On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Well, since the cache trimming is removing entries from the lists, I don't see how that can be done with a global lock for list updates? Well, the global lock is what we have now, but the cache trimming process only looks at one list at a time, so not locking the list that isn't being iterated over probably wouldn't hurt, unless there's some mechanism (that I didn't see) for entries to move from one list to another. Note that I'm considering each hash bucket a separate list. (One issue to worry about in that case would be cache-line contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE ought to be increased to reduce that.) Yea, a separate mutex for each hash list might help. There is also the LRU list that all entries end up on, that gets used by the trimming code. (I think? I wrote this stuff about 8 years ago, so I haven't looked at it in a while.) Also, increasing the hash table size is probably a good idea, especially if you reduce how aggressively the cache is trimmed. Only doing it once/sec would result in a very large cache when bursts of traffic arrives. My servers have 96 GB of memory so that's not a big deal for me. This code was originally production tested on a server with 1Gbyte, so times have changed a bit;-) I'm not sure I see why doing it as a separate thread will improve things. There are N nfsd threads already (N can be bumped up to 256 if you wish) and having a bunch more cache trimming threads would just increase contention, wouldn't it? Only one cache-trimming thread. The cache trim holds the (global) mutex for much longer than any individual nfsd service thread has any need to, and having N threads doing that in parallel is why it's so heavily contended. If there's only one thread doing the trim, then the nfsd service threads aren't spending time either contending on the mutex (it will be held less frequently and for shorter periods). I think the little drc2.patch which will keep the nfsd threads from acquiring the mutex and doing the trimming most of the time, might be sufficient. I still don't see why a separate trimming thread will be an advantage. I'd also be worried that the one cache trimming thread won't get the job done soon enough. When I did production testing on a 1Gbyte server that saw a peak load of about 100RPCs/sec, it was necessary to trim aggressively. (Although I'd be tempted to say that a server with 1Gbyte is no longer relevant, I recently recall someone trying to run FreeBSD on a i486, although I doubt they wanted to run the nfsd on it.) The only negative effect I can think of w.r.t. having the nfsd threads doing it would be a (I believe negligible) increase in RPC response times (the time the nfsd thread spends trimming the cache). As noted, I think this time would be negligible compared to disk I/O and network transit times in the total RPC response time? With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G network connectivity, spinning on a contended mutex takes a significant amount of CPU time. (For the current design of the NFS server, it may actually be a win to turn off adaptive mutexes -- I should give that a try once I'm able to do more testing.) Have fun with it. Let me know when you have what you think is a good patch. rick -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: NFS server bottlenecks
Garrett Wollman wrote: I had an email conversation with Rick Macklem about six months ago about NFS server bottlenecks. I'm now in a position to observe my large-scale NFS server under an actual production load, so I thought I would update folks on what it looks like. This is a 9.1 prerelease kernel (I hope 9.1 will be released soon as I have four moe of these servers to deploy!). When under nearly 100% load on an 8-core (16-thread) Quanta QSSC-S99Q storage server, with a 10G network interface, pmcstat tells me this: PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved Key: q = exiting... %SAMP IMAGE FUNCTION CALLERS 29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0 nfsrvd_getcache:7.4 ... 9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4 7.4 zfs.ko lzjb_decompress zio_decompress 4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ... 4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1 zio_checksum_compute:0.8 3.6 kernel cpu_search_lowest cpu_search_lowest 3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6 2.3 kernel ipfw_chk ipfw_check_hook 2.1 pmcstat _init 1.1 kernel _sx_xunlock 0.9 kernel _sx_xlock 0.9 kernel spinlock_exit This does seem to confirm my original impression that the NFS replay cache is quite expensive. Running a gprof(1) analysis on the same PMC data reveals a bit more detail (I've removed some uninteresting parts of the call graph): I can't remember (I am early retired now;-) if I mentioned this patch before: http://people.freebsd.org/~rmacklem/drc.patch It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can be twiddled so that the drc is trimmed less frequently. By making these values larger, the trim will only happen once/sec until the high water mark is reached, instead of on every RPC. The tradeoff is that the DRC will become larger, but given memory sizes these days, that may be fine for you. jwd@ was going to test it, but he moved to a different job away from NFS, so the patch has just been collecting dust. If you could test it, that would be nice, rick ps: Also, the current patch still locks before checking if it needs to do the trim. I think that could safely be changed so that it doesn't lock/unlock when it isn't doing the trim, if that makes a significant difference. called/total parents index %time self descendents called+self name index called/total children 4881.00 2004642.70 932627/932627 svc_run_internal [2] [4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4] 13199.00 504436.33 584319/584319 nfsrvd_updatecache [9] 23075.00 403396.18 468009/468009 nfsrvd_getcache [14] 1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15] 6168.00 381770.44 11618/11618 nfsrvd_dorpc [24] 3526.87 86869.88 112478/112514 nfsrvd_sentcache [74] 890.00 50540.89 4252/4252 svc_getcred [101] 14876.60 32394.26 4177/24500 crfree cycle 3 [263] 11550.11 25150.73 3243/24500 free cycle 3 [102] 1348.88 15451.66 2716/16831 m_freem [59] 4066.61 216.81 1434/1456 svc_freereq [321] 2342.15 677.40 557/1459 malloc_type_freed [265] 59.14 1916.84 134/2941 crget [113] 1602.25 0.00 322/9682 bzero [105] 690.93 0.00 43/44 getmicrotime [571] 287.22 7.33 138/1205 prison_free [384] 233.61 0.00 60/798 PHYS_TO_VM_PAGE [358] 203.12 0.00 94/230 nfsrv_mallocmget_limit [632] 151.76 0.00 51/1723 pmap_kextract [309] 0.78 70.28 9/3281 _mtx_unlock_sleep [154] 19.22 16.88 38/400403 nfsrc_trimcache [26] 11.05 21.74 7/197 crsetgroups [532] 30.37 0.00 11/6592 critical_enter [190] 25.50 0.00 9/36 turnstile_chain_unlock [844] 24.86 0.00 3/7 nfsd_errmap [913] 12.36 8.57 8/2145 in_cksum_skip [298] 9.10 3.59 5/12455 mb_free_ext [140] 1.84 4.85 2/2202 VOP_UNLOCK_APV [269] --- 0.49 0.15 1/1129009 uhub_explore [1581] 0.49 0.15 1/1129009 tcp_output [10] 0.49 0.15 1/1129009 pmap_remove_all [1141] 0.49 0.15 1/1129009 vm_map_insert [236] 0.49 0.15 1/1129009 vnode_create_vobject [281] 0.49 0.15 1/1129009 biodone [351] 0.49 0.15 1/1129009 vm_object_madvise [670] 0.49 0.15 1/1129009 xpt_done [483] 0.49 0.15 1/1129009 vputx [80] 0.49 0.15 1/1129009 vm_map_delete cycle 3 [49] 0.49 0.15 1/1129009 vm_object_deallocate cycle 3 [356] 0.49 0.15 1/1129009 vm_page_unwire [338] 0.49 0.15 1/1129009 pmap_change_wiring [318] 0.98 0.31 2/1129009 getnewvnode [227] 0.98 0.31 2/1129009 pmap_clear_reference [1004] 0.98 0.31 2/1129009 usbd_do_request_flags [1282] 0.98 0.31 2/1129009 vm_object_collapse cycle 3 [587] 0.98 0.31 2/1129009 vm_object_page_remove [122] 1.48 0.46 3/1129009 mpt_pci_intr [487] 1.48 0.46 3/1129009 pmap_extract [355] 1.48 0.46 3/1129009 vm_fault_unwire [171] 1.97 0.62 4/1129009 vgonel [270] 1.97 0.62 4/1129009 vm_object_shadow [926] 1.97 0.62 4/1129009 zone_alloc_item [434] 2.46 0.77 5/1129009 vnlru_free [235] 2.46 0.77 5/1129009 insmntque1 [737] 2.95 0.93 6/1129009 zone_free_item [409] 3.94 1.24
Re: NFS server bottlenecks
[Adding freebsd-fs@ to the Cc list, which I neglected the first time around...] On Tue, 2 Oct 2012 08:28:29 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca said: I can't remember (I am early retired now;-) if I mentioned this patch before: http://people.freebsd.org/~rmacklem/drc.patch It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can be twiddled so that the drc is trimmed less frequently. By making these values larger, the trim will only happen once/sec until the high water mark is reached, instead of on every RPC. The tradeoff is that the DRC will become larger, but given memory sizes these days, that may be fine for you. It will be a while before I have another server that isn't in production (it's on my deployment plan, but getting the production servers going is taking first priority). The approaches that I was going to look at: Simplest: only do the cache trim once every N requests (for some reasonable value of N, e.g., 1000). Maybe keep track of the number of entries in each hash bucket and ignore those buckets that only have one entry even if is stale. Simple: just use a sepatate mutex for each list that a cache entry is on, rather than a global lock for everything. This would reduce the mutex contention, but I'm not sure how significantly since I don't have the means to measure it yet. Moderately complicated: figure out if a different synchronization type can safely be used (e.g., rmlock instead of mutex) and do so. More complicated: move all cache trimming to a separate thread and just have the rest of the code wake it up when the cache is getting too big (or just once a second since that's easy to implement). Maybe just move all cache processing to a separate thread. It's pretty clear from the profile that the cache mutex is heavily contended, so anything that reduces the length of time it's held is probably a win. That URL again, for the benefit of people on freebsd-fs who didn't see it on hackers, is: http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png. (This graph is slightly modified from my previous post as I removed some spurious edges to make the formatting look better. Still looking for a way to get a profile that includes all kernel modules with the kernel.) -GAWollman ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org