subject:"Re\: NFS server bottlenecks"

Re: NFS server bottlenecks

2012-10-22 Thread Rick Macklem

Ivan Voras wrote:
 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
  Here are the results from testing both patches :
  http://home.totalterror.net/freebsd/nfstest/results.html
  Both tests ran for about 14 hours ( a bit too much, but I wanted to
  compare different zfs recordsize settings ),
  and were done first after a fresh reboot.
  The only noticeable difference seems to be much more context
  switches with Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.

Oh, I realized that, if you are testing 9/stable (and not head), that
you won't have r227809. Without that, all reads on a given file will
be serialized, because the server will acquire an exclusive lock on
the vnode.

The patch for r227809 in head is at:
  http://people.freebsd.org/~rmacklem/lkshared.patch
This should apply fine to a 9 system (but not 8.n), I think.

Good luck with it and have fun, rick

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-22 Thread Nikolay Denev


On Oct 23, 2012, at 2:36 AM, Rick Macklem rmack...@uoguelph.ca wrote:

 Ivan Voras wrote:
 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
 Here are the results from testing both patches :
 http://home.totalterror.net/freebsd/nfstest/results.html
 Both tests ran for about 14 hours ( a bit too much, but I wanted to
 compare different zfs recordsize settings ),
 and were done first after a fresh reboot.
 The only noticeable difference seems to be much more context
 switches with Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.
 
 Oh, I realized that, if you are testing 9/stable (and not head), that
 you won't have r227809. Without that, all reads on a given file will
 be serialized, because the server will acquire an exclusive lock on
 the vnode.
 
 The patch for r227809 in head is at:
  http://people.freebsd.org/~rmacklem/lkshared.patch
 This should apply fine to a 9 system (but not 8.n), I think.
 
 Good luck with it and have fun, rick
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org

Thanks, I've applied the patch by hand because of some differences and I'm now 
rebuilding.

In case they are still needed here are the dd tests with loopback UDP mount :

http://home.totalterror.net/freebsd/nfstest/udp-dd.html

Over udp writing degrades much worse...
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev


On Oct 18, 2012, at 6:11 PM, Nikolay Denev nde...@gmail.com wrote:

 
 On Oct 15, 2012, at 5:34 PM, Ivan Voras ivo...@freebsd.org wrote:
 
 On 15 October 2012 16:31, Nikolay Denev nde...@gmail.com wrote:
 
 On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote:
 
 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
 
 It should apply to HEAD without Rick's patches.
 
 It's a bit different approach than Rick's, breaking down locks even more.
 
 Applied and compiled OK, I will be able to test it tomorrow.
 
 Ok, thanks!
 
 The differences should be most visible in edge cases with a larger
 number of nfsd processes (16+) and many CPU cores.
 
 I'm now rebooting with your patch, and hopefully will have some results 
 tomorrow.
 

Here are the results from testing both patches : 
http://home.totalterror.net/freebsd/nfstest/results.html
Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
different zfs recordsize settings ),
and were done first after a fresh reboot.
The only noticeable difference seems to be much more context switches with 
Ivan's patch.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Ivan Voras

On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:

 Here are the results from testing both patches : 
 http://home.totalterror.net/freebsd/nfstest/results.html
 Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
 different zfs recordsize settings ),
 and were done first after a fresh reboot.
 The only noticeable difference seems to be much more context switches with 
 Ivan's patch.

Thank you very much for your extensive testing!

I don't know how to interpret the rise in context switches; as this is
kernel code, I'd expect no context switches. I hope someone else can
explain.

But, you have also shown that my patch doesn't do any better than
Rick's even on a fairly large configuration, so I don't think there's
value in adding the extra complexity, and Rick knows NFS much better
than I do.

But there are a few things other than that I'm interested in: like why
does your load average spike almost to 20-ties, and how come that with
24 drives in RAID-10 you only push through 600 MBit/s through the 10
GBit/s Ethernet. Have you tested your drive setup locally (AESNI
shouldn't be a bottleneck, you should be able to encrypt well into
Gbyte/s range) and the network?

If you have the time, could you repeat the tests but with a recent
Samba server and a CIFS mount on the client side? This is probably not
important, but I'm just curious of how would it perform on your
machine.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Rick Macklem

Ivan Voras wrote:
 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
  Here are the results from testing both patches :
  http://home.totalterror.net/freebsd/nfstest/results.html
  Both tests ran for about 14 hours ( a bit too much, but I wanted to
  compare different zfs recordsize settings ),
  and were done first after a fresh reboot.
  The only noticeable difference seems to be much more context
  switches with Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
Don't the mtx_lock() calls spin for a little while and then context
switch if another thread still has it locked?

 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
Hmm, I didn't look, but were there any tests using UDP mounts?
(I would have thought that your patch would mainly affect UDP mounts,
 since that is when my version still has the single LRU queue/mutex.
 As I think you know, my concern with your patch would be correctness
 for UDP, not performance.)

Anyhow, sounds like you guys are having fun with it and learning
some useful things.

Keep up the good work, rick
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev


On Oct 20, 2012, at 3:11 PM, Ivan Voras ivo...@freebsd.org wrote:

 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
 Here are the results from testing both patches : 
 http://home.totalterror.net/freebsd/nfstest/results.html
 Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
 different zfs recordsize settings ),
 and were done first after a fresh reboot.
 The only noticeable difference seems to be much more context switches with 
 Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.

I've now started this test locally.
But from previous different iozone runs, I remember locally the speed was much 
better,
but I will wait for this test to finish, as the comparison will be better.

But I think there is still something fishy… I have cases where I have reached 
1000MB/s over NFS
(from network stats, not local machine stats), but sometimes it is very slow 
even for 
file completely in ARC. Rick mentioned that this could be due to RPC overhead 
and network round trip time, but
earlier in this thread I've done a test only on the server by mounting the NFS 
exported ZFS dataset locally and did some tests with dd:

 To take the network out of the equation I redid the test by mounting the same 
 filesystem over NFS on the server:
 
 [18:23]root@goliath:~#  mount -t nfs -o 
 rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576 
 localhost:/tank/spa_db/undo /mnt
 [18:24]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M 
 30720+1 records in
 30720+1 records out
 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec)
 [18:25]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M
 30720+1 records in
 30720+1 records out
 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec)
 
 During the first run I saw several nfsd threads in top, along with dd and 
 again zero disk I/O.
 There was increase in memory usage because of the double buffering 
 ARC-buffercahe.
 The second run was with all of the nfsd threads totally idle, and read 
 directly from the buffercache.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev


On Oct 20, 2012, at 3:11 PM, Ivan Voras ivo...@freebsd.org wrote:

 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
 Here are the results from testing both patches : 
 http://home.totalterror.net/freebsd/nfstest/results.html
 Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
 different zfs recordsize settings ),
 and were done first after a fresh reboot.
 The only noticeable difference seems to be much more context switches with 
 Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.

The first iozone local run finished, I'll paste just the result here, and also 
the same test over NFS for comparison:
(This is iozone doing 8k sized IO ops, on ZFS dataset with recordsize=8k)

NFS:
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   
33554432   849735522 2930 290629083886  


Local:
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   
33554432   8   34740   41390   135442   142534   24992   12493  



P.S.: I forgot to mention that the network is with 9K mtu.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev


On Oct 20, 2012, at 4:00 PM, Nikolay Denev nde...@gmail.com wrote:

 
 On Oct 20, 2012, at 3:11 PM, Ivan Voras ivo...@freebsd.org wrote:
 
 On 20 October 2012 13:42, Nikolay Denev nde...@gmail.com wrote:
 
 Here are the results from testing both patches : 
 http://home.totalterror.net/freebsd/nfstest/results.html
 Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
 different zfs recordsize settings ),
 and were done first after a fresh reboot.
 The only noticeable difference seems to be much more context switches with 
 Ivan's patch.
 
 Thank you very much for your extensive testing!
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 But, you have also shown that my patch doesn't do any better than
 Rick's even on a fairly large configuration, so I don't think there's
 value in adding the extra complexity, and Rick knows NFS much better
 than I do.
 
 But there are a few things other than that I'm interested in: like why
 does your load average spike almost to 20-ties, and how come that with
 24 drives in RAID-10 you only push through 600 MBit/s through the 10
 GBit/s Ethernet. Have you tested your drive setup locally (AESNI
 shouldn't be a bottleneck, you should be able to encrypt well into
 Gbyte/s range) and the network?
 
 If you have the time, could you repeat the tests but with a recent
 Samba server and a CIFS mount on the client side? This is probably not
 important, but I'm just curious of how would it perform on your
 machine.
 
 The first iozone local run finished, I'll paste just the result here, and 
 also the same test over NFS for comparison:
 (This is iozone doing 8k sized IO ops, on ZFS dataset with recordsize=8k)
 
 NFS:
random  random
 bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
 read  rewrite read   
33554432   849735522 2930 290629083886 
  
 
 Local:
random  random
 bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
 read  rewrite read   
33554432   8   34740   41390   135442   142534   24992   12493 
  
 
 
 P.S.: I forgot to mention that the network is with 9K mtu.


Here are the full results of the test on the local fs :

http://home.totalterror.net/freebsd/nfstest/local_fs/

I'm now running the same test on NFS mount over the loopback interface on the 
NFS server machine.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Ivan Voras

On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote:
 Ivan Voras wrote:

 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.

 Don't the mtx_lock() calls spin for a little while and then context
 switch if another thread still has it locked?

Yes, but are in-kernel context switches also counted? I was assuming
they are light-weight enough not to count.

 Hmm, I didn't look, but were there any tests using UDP mounts?
 (I would have thought that your patch would mainly affect UDP mounts,
  since that is when my version still has the single LRU queue/mutex.

Another assumption - I thought UDP was the default.

  As I think you know, my concern with your patch would be correctness
  for UDP, not performance.)

Yes.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Outback Dingo

On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote:
 Ivan Voras wrote:

 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.

 Don't the mtx_lock() calls spin for a little while and then context
 switch if another thread still has it locked?

 Yes, but are in-kernel context switches also counted? I was assuming
 they are light-weight enough not to count.

 Hmm, I didn't look, but were there any tests using UDP mounts?
 (I would have thought that your patch would mainly affect UDP mounts,
  since that is when my version still has the single LRU queue/mutex.

 Another assumption - I thought UDP was the default.

  As I think you know, my concern with your patch would be correctness
  for UDP, not performance.)

 Yes.

Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB
drives on an LSI controller.
Im watching the thread patiently, im kinda looking for results, and
answers, Though Im also tempted to
run benchmarks on my system also see if i get similar results I also
considered that netmap might be one
but not quite sure if it would help NFS, since its to hard to tell if
its a network bottle neck, though it appears
to be network related.

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev


On Oct 20, 2012, at 10:45 PM, Outback Dingo outbackdi...@gmail.com wrote:

 On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org wrote:
 On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote:
 Ivan Voras wrote:
 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
 Don't the mtx_lock() calls spin for a little while and then context
 switch if another thread still has it locked?
 
 Yes, but are in-kernel context switches also counted? I was assuming
 they are light-weight enough not to count.
 
 Hmm, I didn't look, but were there any tests using UDP mounts?
 (I would have thought that your patch would mainly affect UDP mounts,
 since that is when my version still has the single LRU queue/mutex.
 
 Another assumption - I thought UDP was the default.
 
 As I think you know, my concern with your patch would be correctness
 for UDP, not performance.)
 
 Yes.
 
 Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB
 drives on an LSI controller.
 Im watching the thread patiently, im kinda looking for results, and
 answers, Though Im also tempted to
 run benchmarks on my system also see if i get similar results I also
 considered that netmap might be one
 but not quite sure if it would help NFS, since its to hard to tell if
 its a network bottle neck, though it appears
 to be network related.
 
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Doesn't look like network issue to me. From my observations it's more like some 
overhead in nfs and arc.
The boxes easily push 10G with simple iperf test.
Running two iperf test over each port of the dual ported 10G nics gives 
960MB/sec regardles which machine is the server.
Also, I've seen over 960Gb/sec over NFS with this setup, but I can't understand 
what type of workload was able to do this.
At some point I was able to do this with simple dd, then after a reboot I was 
no longer to push this traffic.
I'm thinking something like ARC/kmem fragmentation might be the issue?
 

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-20 Thread Rick Macklem

Outback Dingo wrote:
 On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras ivo...@freebsd.org
 wrote:
  On 20 October 2012 14:45, Rick Macklem rmack...@uoguelph.ca wrote:
  Ivan Voras wrote:
 
  I don't know how to interpret the rise in context switches; as
  this is
  kernel code, I'd expect no context switches. I hope someone else
  can
  explain.
 
  Don't the mtx_lock() calls spin for a little while and then context
  switch if another thread still has it locked?
 
  Yes, but are in-kernel context switches also counted? I was assuming
  they are light-weight enough not to count.
 
  Hmm, I didn't look, but were there any tests using UDP mounts?
  (I would have thought that your patch would mainly affect UDP
  mounts,
   since that is when my version still has the single LRU
   queue/mutex.
 
  Another assumption - I thought UDP was the default.
 
TCP has been the default for a FreeBSD client for a long time. It was
changed for the old NFS client before I became a committer. (You can
explicitly set one or the other as mount options or check via wireshark/tcpdump)

   As I think you know, my concern with your patch would be
   correctness
   for UDP, not performance.)
 
  Yes.
 
 Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB
 drives on an LSI controller.
 Im watching the thread patiently, im kinda looking for results, and
 answers, Though Im also tempted to
 run benchmarks on my system also see if i get similar results I also
 considered that netmap might be one
 but not quite sure if it would help NFS, since its to hard to tell if
 its a network bottle neck, though it appears
 to be network related.
 
NFS network traffic looks very different that a TCP stream (ala bit torrent
or ...). I've seen this cause issues before. You can look at a packet trace
in wireshark and see if TCP is retransmitting segments.

rick

  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-18 Thread Nikolay Denev


On Oct 15, 2012, at 5:34 PM, Ivan Voras ivo...@freebsd.org wrote:

 On 15 October 2012 16:31, Nikolay Denev nde...@gmail.com wrote:
 
 On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote:
 
 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
 
 It should apply to HEAD without Rick's patches.
 
 It's a bit different approach than Rick's, breaking down locks even more.
 
 Applied and compiled OK, I will be able to test it tomorrow.
 
 Ok, thanks!
 
 The differences should be most visible in edge cases with a larger
 number of nfsd processes (16+) and many CPU cores.

I'm now rebooting with your patch, and hopefully will have some results 
tomorrow.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Ivan Voras

On 13/10/2012 17:22, Nikolay Denev wrote:

 drc3.patch applied and build cleanly and shows nice improvement!
 
 I've done a quick benchmark using iozone over the NFS mount from the Linux 
 host.
 

Hi,

If you are already testing, could you please also test this patch:

http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch

It should apply to HEAD without Rick's patches.

It's a bit different approach than Rick's, breaking down locks even more.



signature.asc
Description: OpenPGP digital signature

Re: NFS server bottlenecks

2012-10-15 Thread Nikolay Denev


On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote:

 On 13/10/2012 17:22, Nikolay Denev wrote:
 
 drc3.patch applied and build cleanly and shows nice improvement!
 
 I've done a quick benchmark using iozone over the NFS mount from the Linux 
 host.
 
 
 Hi,
 
 If you are already testing, could you please also test this patch:
 
 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
 
 It should apply to HEAD without Rick's patches.
 
 It's a bit different approach than Rick's, breaking down locks even more.
 

I will try to apply it to RELENG_9 as that's what I'm running and compare the 
results.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Nikolay Denev


On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote:

 On 13/10/2012 17:22, Nikolay Denev wrote:
 
 drc3.patch applied and build cleanly and shows nice improvement!
 
 I've done a quick benchmark using iozone over the NFS mount from the Linux 
 host.
 
 
 Hi,
 
 If you are already testing, could you please also test this patch:
 
 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
 
 It should apply to HEAD without Rick's patches.
 
 It's a bit different approach than Rick's, breaking down locks even more.
 

Applied and compiled OK, I will be able to test it tomorrow.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Ivan Voras

On 15 October 2012 16:31, Nikolay Denev nde...@gmail.com wrote:

 On Oct 15, 2012, at 2:52 PM, Ivan Voras ivo...@freebsd.org wrote:

 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch

 It should apply to HEAD without Rick's patches.

 It's a bit different approach than Rick's, breaking down locks even more.

 Applied and compiled OK, I will be able to test it tomorrow.

Ok, thanks!

The differences should be most visible in edge cases with a larger
number of nfsd processes (16+) and many CPU cores.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread John Baldwin

On Saturday, October 13, 2012 9:03:22 am Rick Macklem wrote:
 rick
 ps: I hope John doesn't mind being added to the cc list yet again. It's
 just that I suspect he knows a fair bit about mutex implementation
 and possible hardware cache line effects.

Currently mtx_pool just uses a simple array (I have patches to force the
array members to be cache-aligned, but they haven't been shown to help in
any benchmarks to date).  I do think though that I would prefer embedding
the mutexes in the hash table entries directly.  This is what we do for the
turnstile and sleep queue hash tables.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Rick Macklem

Ivan Voras wrote:
 On 13/10/2012 17:22, Nikolay Denev wrote:
 
  drc3.patch applied and build cleanly and shows nice improvement!
 
  I've done a quick benchmark using iozone over the NFS mount from the
  Linux host.
 
 
 Hi,
 
 If you are already testing, could you please also test this patch:
 
 http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
 
I don't think (it is hard to test this) your trim cache algorithm
will choose the correct entries to delete.

The problem is that UDP entries very seldom time out (unless the
NFS server isn't seeing hardly any load) and are mostly trimmed
because the size exceeds the highwater mark.

With your code, it will clear out all of the entries in the first
hash buckets that aren't currently busy, until the total count
drops below the high water mark. (If you monitor a busy server
with nfsstat -e -s, you'll see the cache never goes below the
high water mark, which is 500 by default.) This would delete
entries of fairly recent requests.

If you are going to replace the global LRU list with ones for
each hash bucket, then you'll have to compare the time stamps
on the least recently used entries of all the hash buckets and
then delete those. If you keep the timestamp of the least recent
one for that hash bucket in the hash bucket head, you could at least
use that to select which bucket to delete from next, but you'll still
need to:
  - lock that hash bucket
- delete a few entries from that bucket's lru list
  - unlock hash bucket
- repeat for various buckets until the count is beloew the high
  water mark
Or something like that. I think you'll find it a lot more work that
one LRU list and one mutex. Remember that mutex isn't held for long.

Btw, the code looks very nice. (If I was being a style(9) zealot,
I'd remind you that it likes return (X); and not return X;.

rick

 It should apply to HEAD without Rick's patches.
 
 It's a bit different approach than Rick's, breaking down locks even
 more.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Ivan Voras

On 15 October 2012 22:58, Rick Macklem rmack...@uoguelph.ca wrote:

 The problem is that UDP entries very seldom time out (unless the
 NFS server isn't seeing hardly any load) and are mostly trimmed
 because the size exceeds the highwater mark.

 With your code, it will clear out all of the entries in the first
 hash buckets that aren't currently busy, until the total count
 drops below the high water mark. (If you monitor a busy server
 with nfsstat -e -s, you'll see the cache never goes below the
 high water mark, which is 500 by default.) This would delete
 entries of fairly recent requests.

You are right about that, if testing by Nikolay goes reasonably well,
I'll work on that.

 If you are going to replace the global LRU list with ones for
 each hash bucket, then you'll have to compare the time stamps
 on the least recently used entries of all the hash buckets and
 then delete those. If you keep the timestamp of the least recent
 one for that hash bucket in the hash bucket head, you could at least
 use that to select which bucket to delete from next, but you'll still
 need to:
   - lock that hash bucket
 - delete a few entries from that bucket's lru list
   - unlock hash bucket
 - repeat for various buckets until the count is beloew the high
   water mark

Ah, I think I get it: is the reliance on the high watermark as a
criteria for cache expiry the reason the list is a LRU instead of an
ordinary unordered list?

 Or something like that. I think you'll find it a lot more work that
 one LRU list and one mutex. Remember that mutex isn't held for long.

It could be, but the current state of my code is just groundwork for
the next things I have in plan:

1) Move the expiry code (the trim function) into a separate thread,
run periodically (or as a callout, I'll need to talk with someone
about which one is cheaper)

2) Replace the mutex with a rwlock. The only thing which is preventing
me from doing this right away is the LRU list, since each read access
modifies it (and requires a write lock). This is why I was asking you
if we can do away with the LRU algorithm.

 Btw, the code looks very nice. (If I was being a style(9) zealot,
 I'd remind you that it likes return (X); and not return X;.

Thanks, I'll make it more style(9) compliant as I go along.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-15 Thread Rick Macklem

Ivan Voras wrote:
 On 15 October 2012 22:58, Rick Macklem rmack...@uoguelph.ca wrote:
 
  The problem is that UDP entries very seldom time out (unless the
  NFS server isn't seeing hardly any load) and are mostly trimmed
  because the size exceeds the highwater mark.
 
  With your code, it will clear out all of the entries in the first
  hash buckets that aren't currently busy, until the total count
  drops below the high water mark. (If you monitor a busy server
  with nfsstat -e -s, you'll see the cache never goes below the
  high water mark, which is 500 by default.) This would delete
  entries of fairly recent requests.
 
 You are right about that, if testing by Nikolay goes reasonably well,
 I'll work on that.
 
  If you are going to replace the global LRU list with ones for
  each hash bucket, then you'll have to compare the time stamps
  on the least recently used entries of all the hash buckets and
  then delete those. If you keep the timestamp of the least recent
  one for that hash bucket in the hash bucket head, you could at least
  use that to select which bucket to delete from next, but you'll
  still
  need to:
- lock that hash bucket
  - delete a few entries from that bucket's lru list
- unlock hash bucket
  - repeat for various buckets until the count is beloew the high
water mark
 
 Ah, I think I get it: is the reliance on the high watermark as a
 criteria for cache expiry the reason the list is a LRU instead of an
 ordinary unordered list?
 
Yes, I think you've gt it;-)

Have fun with it, rick

  Or something like that. I think you'll find it a lot more work that
  one LRU list and one mutex. Remember that mutex isn't held for long.
 
 It could be, but the current state of my code is just groundwork for
 the next things I have in plan:
 
 1) Move the expiry code (the trim function) into a separate thread,
 run periodically (or as a callout, I'll need to talk with someone
 about which one is cheaper)
 
 2) Replace the mutex with a rwlock. The only thing which is preventing
 me from doing this right away is the LRU list, since each read access
 modifies it (and requires a write lock). This is why I was asking you
 if we can do away with the LRU algorithm.
 
  Btw, the code looks very nice. (If I was being a style(9) zealot,
  I'd remind you that it likes return (X); and not return X;.
 
 Thanks, I'll make it more style(9) compliant as I go along.
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-13 Thread Rick Macklem

Garrett Wollman wrote:
 On Fri, 12 Oct 2012 22:05:54 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  I've attached the patch drc3.patch (it assumes drc2.patch has
  already been
  applied) that replaces the single mutex with one for each hash list
  for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.
 
 I haven't tested this at all, but I think putting all of the mutexes
 in an array like that is likely to cause cache-line ping-ponging. It
 may be better to use a pool mutex, or to put the mutexes adjacent in
 memory to the list heads that they protect. 
Well, I'll admit I don't know how to do this.

What the code does need is a set of mutexes, where any of the mutexes
can be referred to by an index. I could easily define a structure that
has:
struct nfsrc_hashhead {
 struct nfsrvcachehead head;
 struct mtx mutex;
} nfsrc_hashhead[NFSRVCACHE_HASHSIZE];
- but all that does is leave a small structure between each struct mtx and I
  wouldn't have thought that would make much difference. (How big is a typical
  hardware cache line these days? I have no idea.)
  - I suppose I could waste space and define a glob of unused space
between them, like:
struct nfsrc_hashhead {
 struct nfsrvcachehead head;
 char garbage[N];
 struct mtx mutex;
} nfsrc_hashhead[NFSRVCACHE_HASHSIZE];
- If this makes sense, how big should N be? (Somewhat less that the length
  of a cache line, I'd guess. It seems that the structure should be at least
  a cache line length in size.)

All this seems kinda hokey to me and beyond what code at this level should
be worrying about, but I'm game to make changes, if others think it's 
appropriate.

I've never use mtx_pool(9) mutexes, but it doesn't sound like they would
be the right fit, from reading the man page. (Assuming the mtx_pool_find()
is guaranteed to return the same mutex for the same address passed in as
an argument, it would seem that they would work, since I can pass
nfsrvcachehead[i] in as the pointer arg to index a mutex.)
Hopefully jhb@ can say if using mtx_pool(9) for this would be better than
an array:
   struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE];

Does anyone conversant with mutexes know what the best coding approach is?

(But I probably won't be
 able to do the performance testing on any of these for a while. I
 have a server running the drc2 code but haven't gotten my users to
 put a load on it yet.)
 
No rush. At this point, the earliest I could commit something like this to
head would be December.

rick
ps: I hope John doesn't mind being added to the cc list yet again. It's
just that I suspect he knows a fair bit about mutex implementation
and possible hardware cache line effects.

 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-13 Thread Nikolay Denev


On Oct 13, 2012, at 5:05 AM, Rick Macklem rmack...@uoguelph.ca wrote:

 I wrote:
 Oops, I didn't get the readahead option description
 quite right in the last post. The default read ahead
 is 1, which does result in rsize * 2, since there is
 the read + 1 readahead.
 
 rsize * 16 would actually be for the option readahead=15
 and for readahead=16 the calculation would be rsize * 17.
 
 However, the example was otherwise ok, I think? rick
 
 I've attached the patch drc3.patch (it assumes drc2.patch has already been
 applied) that replaces the single mutex with one for each hash list
 for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.
 
 These patches are also at:
  http://people.freebsd.org/~rmacklem/drc2.patch
  http://people.freebsd.org/~rmacklem/drc3.patch
 in case the attachments don't get through.
 
 rick
 ps: I haven't tested drc3.patch a lot, but I think it's ok?

drc3.patch applied and build cleanly and shows nice improvement!

I've done a quick benchmark using iozone over the NFS mount from the Linux host.

drc2.pach (but with NFSRVCACHE_HASHSIZE=500)

TEST WITH 8K

-
Auto Mode
Using Minimum Record Size 8 KB
Using Maximum Record Size 8 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O -i 0 
-i 1 -i 2
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   fwrite frewrite   fread  freread
 2097152   819191914 2356 232123351706  


TEST WITH 1M

-
Auto Mode
Using Minimum Record Size 1024 KB
Using Maximum Record Size 1024 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o -O -i 0 
-i 1 -i 2
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   fwrite frewrite   fread  freread
 20971521024  73  64  477  486 496  61  



drc3.patch

TEST WITH 8K

-
Auto Mode
Using Minimum Record Size 8 KB
Using Maximum Record Size 8 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O -i 0 
-i 1 -i 2
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   fwrite frewrite   fread  freread
 2097152   821082397 3001 301330102389  



TEST WITH 1M

-
Auto Mode
Using Minimum Record Size 1024 KB
Using Maximum Record Size 1024 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 1m -q

Re: NFS server bottlenecks

2012-10-12 Thread Rick Macklem

I wrote:
 Oops, I didn't get the readahead option description
 quite right in the last post. The default read ahead
 is 1, which does result in rsize * 2, since there is
 the read + 1 readahead.
 
 rsize * 16 would actually be for the option readahead=15
 and for readahead=16 the calculation would be rsize * 17.
 
 However, the example was otherwise ok, I think? rick

I've attached the patch drc3.patch (it assumes drc2.patch has already been
applied) that replaces the single mutex with one for each hash list
for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.

These patches are also at:
  http://people.freebsd.org/~rmacklem/drc2.patch
  http://people.freebsd.org/~rmacklem/drc3.patch
in case the attachments don't get through.

rick
ps: I haven't tested drc3.patch a lot, but I think it's ok?

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
--- fs/nfsserver/nfs_nfsdcache.c.orig	2012-02-29 21:07:53.0 -0500
+++ fs/nfsserver/nfs_nfsdcache.c	2012-10-03 08:23:24.0 -0400
@@ -164,8 +164,19 @@ NFSCACHEMUTEX;
 int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0;
 #endif	/* !APPLEKEXT */
 
+SYSCTL_DECL(_vfs_nfsd);
+
+static int	nfsrc_tcphighwater = 0;
+SYSCTL_INT(_vfs_nfsd, OID_AUTO, tcphighwater, CTLFLAG_RW,
+nfsrc_tcphighwater, 0,
+High water mark for TCP cache entries);
+static int	nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER;
+SYSCTL_INT(_vfs_nfsd, OID_AUTO, udphighwater, CTLFLAG_RW,
+nfsrc_udphighwater, 0,
+High water mark for UDP cache entries);
+
 static int nfsrc_tcpnonidempotent = 1;
-static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER, nfsrc_udpcachesize = 0;
+static int nfsrc_udpcachesize = 0;
 static TAILQ_HEAD(, nfsrvcache) nfsrvudplru;
 static struct nfsrvhashhead nfsrvhashtbl[NFSRVCACHE_HASHSIZE],
 nfsrvudphashtbl[NFSRVCACHE_HASHSIZE];
@@ -781,8 +792,15 @@ nfsrc_trimcache(u_int64_t sockref, struc
 {
 	struct nfsrvcache *rp, *nextrp;
 	int i;
+	static time_t lasttrim = 0;
 
+	if (NFSD_MONOSEC == lasttrim 
+	nfsrc_tcpsavedreplies  nfsrc_tcphighwater 
+	nfsrc_udpcachesize  (nfsrc_udphighwater +
+	nfsrc_udphighwater / 2))
+		return;
 	NFSLOCKCACHE();
+	lasttrim = NFSD_MONOSEC;
 	TAILQ_FOREACH_SAFE(rp, nfsrvudplru, rc_lru, nextrp) {
 		if (!(rp-rc_flag  (RC_INPROG|RC_LOCKED|RC_WANTED))
 		  rp-rc_refcnt == 0
--- fs/nfsserver/nfs_nfsdcache.c.sav	2012-10-10 18:56:01.0 -0400
+++ fs/nfsserver/nfs_nfsdcache.c	2012-10-12 21:04:21.0 -0400
@@ -160,7 +160,8 @@ __FBSDID($FreeBSD: head/sys/fs/nfsserve
 #include fs/nfs/nfsport.h
 
 extern struct nfsstats newnfsstats;
-NFSCACHEMUTEX;
+extern struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE];
+extern struct mtx nfsrc_udpmtx;
 int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0;
 #endif	/* !APPLEKEXT */
 
@@ -208,10 +209,11 @@ static int newnfsv2_procid[NFS_V3NPROCS]
 	NFSV2PROC_NOOP,
 };
 
+#define	nfsrc_hash(xid)	(((xid) + ((xid)  24)) % NFSRVCACHE_HASHSIZE)
 #define	NFSRCUDPHASH(xid) \
-	(nfsrvudphashtbl[((xid) + ((xid)  24)) % NFSRVCACHE_HASHSIZE])
+	(nfsrvudphashtbl[nfsrc_hash(xid)])
 #define	NFSRCHASH(xid) \
-	(nfsrvhashtbl[((xid) + ((xid)  24)) % NFSRVCACHE_HASHSIZE])
+	(nfsrvhashtbl[nfsrc_hash(xid)])
 #define	TRUE	1
 #define	FALSE	0
 #define	NFSRVCACHE_CHECKLEN	100
@@ -262,6 +264,18 @@ static int nfsrc_getlenandcksum(mbuf_t m
 static void nfsrc_marksametcpconn(u_int64_t);
 
 /*
+ * Return the correct mutex for this cache entry.
+ */
+static __inline struct mtx *
+nfsrc_cachemutex(struct nfsrvcache *rp)
+{
+
+	if ((rp-rc_flag  RC_UDP) != 0)
+		return (nfsrc_udpmtx);
+	return (nfsrc_tcpmtx[nfsrc_hash(rp-rc_xid)]);
+}
+
+/*
  * Initialize the server request cache list
  */
 APPLESTATIC void
@@ -336,10 +350,12 @@ nfsrc_getudp(struct nfsrv_descript *nd, 
 	struct sockaddr_in6 *saddr6;
 	struct nfsrvhashhead *hp;
 	int ret = 0;
+	struct mtx *mutex;
 
+	mutex = nfsrc_cachemutex(newrp);
 	hp = NFSRCUDPHASH(newrp-rc_xid);
 loop:
-	NFSLOCKCACHE();
+	mtx_lock(mutex);
 	LIST_FOREACH(rp, hp, rc_hash) {
 	if (newrp-rc_xid == rp-rc_xid 
 		newrp-rc_proc == rp-rc_proc 
@@ -347,8 +363,8 @@ loop:
 		nfsaddr_match(NETFAMILY(rp), rp-rc_haddr, nd-nd_nam)) {
 			if ((rp-rc_flag  RC_LOCKED) != 0) {
 rp-rc_flag |= RC_WANTED;
-(void)mtx_sleep(rp, NFSCACHEMUTEXPTR,
-(PZERO - 1) | PDROP, nfsrc, 10 * hz);
+(void)mtx_sleep(rp, mutex, (PZERO - 1) | PDROP,
+nfsrc, 10 * hz);
 goto loop;
 			}
 			if (rp-rc_flag == 0)
@@ -358,14 +374,14 @@ loop:
 			TAILQ_INSERT_TAIL(nfsrvudplru, rp, rc_lru);
 			if (rp-rc_flag  RC_INPROG) {
 newnfsstats.srvcache_inproghits++;
-NFSUNLOCKCACHE();
+mtx_unlock(mutex);
 ret = RC_DROPIT;
 			} else if (rp-rc_flag  RC_REPSTATUS) {
 /*
  * V2 only.
  */
 newnfsstats.srvcache_nonidemdonehits++;
-

Re: NFS server bottlenecks

2012-10-12 Thread Garrett Wollman

On Fri, 12 Oct 2012 22:05:54 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca 
said:

 I've attached the patch drc3.patch (it assumes drc2.patch has already been
 applied) that replaces the single mutex with one for each hash list
 for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.

I haven't tested this at all, but I think putting all of the mutexes
in an array like that is likely to cause cache-line ping-ponging.  It
may be better to use a pool mutex, or to put the mutexes adjacent in
memory to the list heads that they protect.  (But I probably won't be
able to do the performance testing on any of these for a while.  I
have a server running the drc2 code but haven't gotten my users to
put a load on it yet.)

-GAWollman
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-11 Thread Nikolay Denev

On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote:

 
 On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Nikolay Denev wrote:
 On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache
 entry
 is on, rather than a global lock for everything. This would
 reduce
 the mutex contention, but I'm not sure how significantly since
 I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the
 lists,
 I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list
 that
 isn't being iterated over probably wouldn't hurt, unless there's
 some
 mechanism (that I didn't see) for entries to move from one list
 to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be
 cache-line
 contention in the array of hash buckets; perhaps
 NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also
 the
 LRU list that all entries end up on, that gets used by the
 trimming
 code.
 (I think? I wrote this stuff about 8 years ago, so I haven't
 looked
 at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea,
 especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with
 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will
 improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if
 you
 wish)
 and having a bunch more cache trimming threads would just
 increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has
 any
 need to, and having N threads doing that in parallel is why it's
 so
 heavily contended. If there's only one thread doing the trim,
 then
 the nfsd service threads aren't spending time either contending
 on
 the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads
 from
 acquiring the mutex and doing the trimming most of the time, might
 be
 sufficient. I still don't see why a separate trimming thread will
 be
 an advantage. I'd also be worried that the one cache trimming
 thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in
 RPC
 response times (the time the nfsd thread spends trimming the
 cache).
 As noted, I think this time would be negligible compared to disk
 I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and
 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the
 NFS
 server, it may actually be a win to turn off adaptive mutexes --
 I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a
 good
 patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to
 freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read
 requests
 over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no
 disk
 IO involved.
 
 Just out of curiousity, why do you use 8K reads instead of 64K
 reads.
 Since the RPC overhead (including the DRC functions) is per RPC,
 doing
 fewer larger RPCs should usually work better. (Sometimes large

Re: NFS server bottlenecks

2012-10-11 Thread Nikolay Denev


On Oct 11, 2012, at 7:20 PM, Nikolay Denev nde...@gmail.com wrote:

 On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote:
 
 
 On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Nikolay Denev wrote:
 On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache
 entry
 is on, rather than a global lock for everything. This would
 reduce
 the mutex contention, but I'm not sure how significantly since
 I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the
 lists,
 I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list
 that
 isn't being iterated over probably wouldn't hurt, unless there's
 some
 mechanism (that I didn't see) for entries to move from one list
 to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be
 cache-line
 contention in the array of hash buckets; perhaps
 NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also
 the
 LRU list that all entries end up on, that gets used by the
 trimming
 code.
 (I think? I wrote this stuff about 8 years ago, so I haven't
 looked
 at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea,
 especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with
 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will
 improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if
 you
 wish)
 and having a bunch more cache trimming threads would just
 increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has
 any
 need to, and having N threads doing that in parallel is why it's
 so
 heavily contended. If there's only one thread doing the trim,
 then
 the nfsd service threads aren't spending time either contending
 on
 the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads
 from
 acquiring the mutex and doing the trimming most of the time, might
 be
 sufficient. I still don't see why a separate trimming thread will
 be
 an advantage. I'd also be worried that the one cache trimming
 thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in
 RPC
 response times (the time the nfsd thread spends trimming the
 cache).
 As noted, I think this time would be negligible compared to disk
 I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and
 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the
 NFS
 server, it may actually be a win to turn off adaptive mutexes --
 I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a
 good
 patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to
 freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read
 requests
 over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no
 disk
 IO involved.
 
 Just out of curiousity, why do you use 8K reads instead of 64K
 reads.
 Since the RPC overhead (including the DRC functions) is per RPC,
 doing

Re: NFS server bottlenecks

2012-10-11 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote:
 
 
  On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly
  since
  I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the
  lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache
  trimming
  process only looks at one list at a time, so not locking the
  list
  that
  isn't being iterated over probably wouldn't hurt, unless
  there's
  some
  mechanism (that I didn't see) for entries to move from one
  list
  to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be
  cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is
  also
  the
  LRU list that all entries end up on, that gets used by the
  trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't
  looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache
  when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for
  me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will
  improve
  things.
  There are N nfsd threads already (N can be bumped up to 256
  if
  you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the
  (global)
  mutex for much longer than any individual nfsd service thread
  has
  any
  need to, and having N threads doing that in parallel is why
  it's
  so
  heavily contended. If there's only one thread doing the trim,
  then
  the nfsd service threads aren't spending time either
  contending
  on
  the
  mutex (it will be held less frequently and for shorter
  periods).
 
  I think the little drc2.patch which will keep the nfsd threads
  from
  acquiring the mutex and doing the trimming most of the time,
  might
  be
  sufficient. I still don't see why a separate trimming thread
  will
  be
  an advantage. I'd also be worried that the one cache trimming
  thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a
  peak
  load of about 100RPCs/sec, it was necessary to trim
  aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run
  FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the
  nfsd
  threads doing it would be a (I believe negligible) increase
  in
  RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to
  disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and
  10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the
  NFS
  server, it may actually be a win to turn off adaptive mutexes
  --
  I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a
  good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
  My quest for IOPS over NFS continues :)
  So far I'm not able to achieve more than about 3000 8K read
  requests
  over NFS,
  while the server locally gives much more.
  And this is all from a file that is completely in ARC cache, no
  disk
  IO involved.
 
  Just out of

Re: NFS server bottlenecks

2012-10-11 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 11, 2012, at 7:20 PM, Nikolay Denev nde...@gmail.com wrote:
 
  On Oct 11, 2012, at 8:46 AM, Nikolay Denev nde...@gmail.com wrote:
 
 
  On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Nikolay Denev wrote:
  On Oct 4, 2012, at 12:36 AM, Rick Macklem
  rmack...@uoguelph.ca
  wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a
  cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly
  since
  I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the
  lists,
  I
  don't
  see how that can be done with a global lock for list
  updates?
 
  Well, the global lock is what we have now, but the cache
  trimming
  process only looks at one list at a time, so not locking the
  list
  that
  isn't being iterated over probably wouldn't hurt, unless
  there's
  some
  mechanism (that I didn't see) for entries to move from one
  list
  to
  another. Note that I'm considering each hash bucket a
  separate
  list. (One issue to worry about in that case would be
  cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is
  also
  the
  LRU list that all entries end up on, that gets used by the
  trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't
  looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache
  when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for
  me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will
  improve
  things.
  There are N nfsd threads already (N can be bumped up to 256
  if
  you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the
  (global)
  mutex for much longer than any individual nfsd service thread
  has
  any
  need to, and having N threads doing that in parallel is why
  it's
  so
  heavily contended. If there's only one thread doing the trim,
  then
  the nfsd service threads aren't spending time either
  contending
  on
  the
  mutex (it will be held less frequently and for shorter
  periods).
 
  I think the little drc2.patch which will keep the nfsd threads
  from
  acquiring the mutex and doing the trimming most of the time,
  might
  be
  sufficient. I still don't see why a separate trimming thread
  will
  be
  an advantage. I'd also be worried that the one cache trimming
  thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a
  peak
  load of about 100RPCs/sec, it was necessary to trim
  aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is
  no
  longer relevant, I recently recall someone trying to run
  FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on
  it.)
 
  The only negative effect I can think of w.r.t. having the
  nfsd
  threads doing it would be a (I believe negligible) increase
  in
  RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to
  disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache,
  and
  10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of
  the
  NFS
  server, it may actually be a win to turn off adaptive mutexes
  --
  I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is
  a
  good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
  My quest for IOPS over NFS continues :)
  So far I'm not able to achieve more than about 3000 8K read
  requests
  over NFS,
  while the server locally gives much more.
  And this is all

Re: NFS server bottlenecks

2012-10-11 Thread Rick Macklem

Oops, I didn't get the readahead option description
quite right in the last post. The default read ahead
is 1, which does result in rsize * 2, since there is
the read + 1 readahead.

rsize * 16 would actually be for the option readahead=15
and for readahead=16 the calculation would be rsize * 17.

However, the example was otherwise ok, I think? rick
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-10 Thread Rick Macklem

Garrett Wollman wrote:
 On Tue, 9 Oct 2012 20:18:00 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  And, although this experiment seems useful for testing patches that
  try
  and reduce DRC CPU overheads, most real NFS servers will be doing
  disk
  I/O.
 
 We don't always have control over what the user does. I think the
 worst-case for my users involves a third-party program (that they're
 not willing to modify) that does line-buffered writes in append mode.
 This uses nearly all of the CPU on per-RPC overhead (each write is
 three RPCs: GETATTR, WRITE, COMMIT).
 
Yes. My comment was simply meant to imply that his testing isn't a
realistic load for most NFS servers. It was not meant to imply that
reducing the CPU overhead/lock contention of the DRC is a useless
exercise.

rick

 -GAWollman
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-10 Thread Garrett Wollman

On Tue, 9 Oct 2012 20:18:00 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca 
said:

 And, although this experiment seems useful for testing patches that try
 and reduce DRC CPU overheads, most real NFS servers will be doing disk
 I/O.

We don't always have control over what the user does.  I think the
worst-case for my users involves a third-party program (that they're
not willing to modify) that does line-buffered writes in append mode.
This uses nearly all of the CPU on per-RPC overhead (each write is
three RPCs: GETATTR, WRITE, COMMIT).

-GAWollman

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-10 Thread Nikolay Denev


On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca wrote:

 Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache
 entry
 is on, rather than a global lock for everything. This would
 reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the lists,
 I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list
 that
 isn't being iterated over probably wouldn't hurt, unless there's
 some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps
 NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also
 the
 LRU list that all entries end up on, that gets used by the trimming
 code.
 (I think? I wrote this stuff about 8 years ago, so I haven't looked
 at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea,
 especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with
 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if you
 wish)
 and having a bunch more cache trimming threads would just
 increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has
 any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on
 the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads from
 acquiring the mutex and doing the trimming most of the time, might
 be
 sufficient. I still don't see why a separate trimming thread will be
 an advantage. I'd also be worried that the one cache trimming thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in RPC
 response times (the time the nfsd thread spends trimming the
 cache).
 As noted, I think this time would be negligible compared to disk
 I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a good
 patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to
 freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read requests
 over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no disk
 IO involved.
 
 Just out of curiousity, why do you use 8K reads instead of 64K reads.
 Since the RPC overhead (including the DRC functions) is per RPC, doing
 fewer larger RPCs should usually work better. (Sometimes large rsize/wsize
 values generate too large a burst of traffic for a network interface to
 handle and then the rsize/wsize has to be decreased to avoid this issue.)
 
 And, although this experiment

Re: NFS server bottlenecks

2012-10-10 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Nikolay Denev wrote:
  On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
  wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly since
  I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the
  lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache trimming
  process only looks at one list at a time, so not locking the list
  that
  isn't being iterated over probably wouldn't hurt, unless there's
  some
  mechanism (that I didn't see) for entries to move from one list
  to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be
  cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is also
  the
  LRU list that all entries end up on, that gets used by the
  trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't
  looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will
  improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if
  you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the (global)
  mutex for much longer than any individual nfsd service thread has
  any
  need to, and having N threads doing that in parallel is why it's
  so
  heavily contended. If there's only one thread doing the trim,
  then
  the nfsd service threads aren't spending time either contending
  on
  the
  mutex (it will be held less frequently and for shorter periods).
 
  I think the little drc2.patch which will keep the nfsd threads
  from
  acquiring the mutex and doing the trimming most of the time, might
  be
  sufficient. I still don't see why a separate trimming thread will
  be
  an advantage. I'd also be worried that the one cache trimming
  thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a peak
  load of about 100RPCs/sec, it was necessary to trim aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in
  RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and
  10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the
  NFS
  server, it may actually be a win to turn off adaptive mutexes --
  I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a
  good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
  My quest for IOPS over NFS continues :)
  So far I'm not able to achieve more than about 3000 8K read
  requests
  over NFS,
  while the server locally gives much more.
  And this is all from a file that is completely in ARC cache, no
  disk
  IO involved.
 
  Just out of curiousity, why do you use 8K reads instead of 64K
  reads.
  Since the RPC overhead (including the DRC functions) is per RPC,
  doing
  fewer larger RPCs should usually work better. (Sometimes large
  rsize/wsize
  values

Re: NFS server bottlenecks

2012-10-10 Thread Nikolay Denev


On Oct 11, 2012, at 1:09 AM, Rick Macklem rmack...@uoguelph.ca wrote:

 Nikolay Denev wrote:
 On Oct 10, 2012, at 3:18 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache
 entry
 is on, rather than a global lock for everything. This would
 reduce
 the mutex contention, but I'm not sure how significantly since
 I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the
 lists,
 I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list
 that
 isn't being iterated over probably wouldn't hurt, unless there's
 some
 mechanism (that I didn't see) for entries to move from one list
 to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be
 cache-line
 contention in the array of hash buckets; perhaps
 NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also
 the
 LRU list that all entries end up on, that gets used by the
 trimming
 code.
 (I think? I wrote this stuff about 8 years ago, so I haven't
 looked
 at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea,
 especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with
 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will
 improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if
 you
 wish)
 and having a bunch more cache trimming threads would just
 increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has
 any
 need to, and having N threads doing that in parallel is why it's
 so
 heavily contended. If there's only one thread doing the trim,
 then
 the nfsd service threads aren't spending time either contending
 on
 the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads
 from
 acquiring the mutex and doing the trimming most of the time, might
 be
 sufficient. I still don't see why a separate trimming thread will
 be
 an advantage. I'd also be worried that the one cache trimming
 thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in
 RPC
 response times (the time the nfsd thread spends trimming the
 cache).
 As noted, I think this time would be negligible compared to disk
 I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and
 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the
 NFS
 server, it may actually be a win to turn off adaptive mutexes --
 I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a
 good
 patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to
 freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read
 requests
 over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no
 disk
 IO involved.
 
 Just out of curiousity, why do you use 8K reads instead of 64K
 reads.
 Since the RPC overhead (including the DRC functions) is per RPC,
 doing
 fewer larger RPCs should usually work better. (Sometimes large
 rsize/wsize
 values generate too large a burst of traffic for a network

Re: NFS server bottlenecks

2012-10-09 Thread Nikolay Denev


On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote:

 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the lists, I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list that
 isn't being iterated over probably wouldn't hurt, unless there's some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also the
 LRU list that all entries end up on, that gets used by the trimming code.
 (I think? I wrote this stuff about 8 years ago, so I haven't looked at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea, especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if you
 wish)
 and having a bunch more cache trimming threads would just increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads from
 acquiring the mutex and doing the trimming most of the time, might be
 sufficient. I still don't see why a separate trimming thread will be
 an advantage. I'd also be worried that the one cache trimming thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in RPC
 response times (the time the nfsd thread spends trimming the cache).
 As noted, I think this time would be negligible compared to disk I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a good patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org

My quest for IOPS over NFS continues :)
So far I'm not able to achieve more than about 3000 8K read requests over NFS,
while the server locally gives much more.
And this is all from a file that is completely in ARC cache, no disk IO 
involved.

I've snatched some sample DTrace script from the net : [ 
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ]

And modified it for our new NFS server :

#!/usr/sbin/dtrace -qs 

fbt:kernel:nfsrvd_*:entry
{
self-ts = timestamp; 
@counts[probefunc] = count();
}

fbt:kernel:nfsrvd_*:return
/ self-ts  0 /
{
this-delta = (timestamp-self-ts)/100;
}

fbt:kernel:nfsrvd_*:return
/ self-ts  0  this-delta  100 /
{
@slow[probefunc, ms] = lquantize(this-delta, 100, 500, 50);
}

Re: NFS server bottlenecks

2012-10-09 Thread Nikolay Denev

On Oct 9, 2012, at 5:12 PM, Nikolay Denev nde...@gmail.com wrote:

 
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the lists, I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list that
 isn't being iterated over probably wouldn't hurt, unless there's some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also the
 LRU list that all entries end up on, that gets used by the trimming code.
 (I think? I wrote this stuff about 8 years ago, so I haven't looked at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea, especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if you
 wish)
 and having a bunch more cache trimming threads would just increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads from
 acquiring the mutex and doing the trimming most of the time, might be
 sufficient. I still don't see why a separate trimming thread will be
 an advantage. I'd also be worried that the one cache trimming thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in RPC
 response times (the time the nfsd thread spends trimming the cache).
 As noted, I think this time would be negligible compared to disk I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a good patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read requests over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no disk IO 
 involved.
 
 I've snatched some sample DTrace script from the net : [ 
 http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ]
 
 And modified it for our new NFS server :
 
 #!/usr/sbin/dtrace -qs 
 
 fbt:kernel:nfsrvd_*:entry
 {
   self-ts = timestamp; 
   @counts[probefunc] = count();
 }
 
 fbt:kernel:nfsrvd_*:return
 / self-ts  0 /
 {
   this-delta = (timestamp-self-ts)/100;
 }
 
 fbt:kernel:nfsrvd_*:return
 / self-ts

Re: NFS server bottlenecks

2012-10-09 Thread Rick Macklem

Nikolay Denev wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly since I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache trimming
  process only looks at one list at a time, so not locking the list
  that
  isn't being iterated over probably wouldn't hurt, unless there's
  some
  mechanism (that I didn't see) for entries to move from one list to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is also
  the
  LRU list that all entries end up on, that gets used by the trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the (global)
  mutex for much longer than any individual nfsd service thread has
  any
  need to, and having N threads doing that in parallel is why it's so
  heavily contended. If there's only one thread doing the trim, then
  the nfsd service threads aren't spending time either contending on
  the
  mutex (it will be held less frequently and for shorter periods).
 
  I think the little drc2.patch which will keep the nfsd threads from
  acquiring the mutex and doing the trimming most of the time, might
  be
  sufficient. I still don't see why a separate trimming thread will be
  an advantage. I'd also be worried that the one cache trimming thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a peak
  load of about 100RPCs/sec, it was necessary to trim aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the NFS
  server, it may actually be a win to turn off adaptive mutexes -- I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
 My quest for IOPS over NFS continues :)
 So far I'm not able to achieve more than about 3000 8K read requests
 over NFS,
 while the server locally gives much more.
 And this is all from a file that is completely in ARC cache, no disk
 IO involved.
 
Just out of curiousity, why do you use 8K reads instead of 64K reads.
Since the RPC overhead (including the DRC functions) is per RPC, doing
fewer larger RPCs should usually work better. (Sometimes large rsize/wsize
values generate too large a burst of traffic for a network interface to
handle and then the rsize/wsize has to be decreased to avoid this issue.)

And, although this

Re: NFS server bottlenecks

2012-10-06 Thread Nikolay Denev

On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca wrote:

 Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the lists, I
 don't
 see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list that
 isn't being iterated over probably wouldn't hurt, unless there's some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
 Yea, a separate mutex for each hash list might help. There is also the
 LRU list that all entries end up on, that gets used by the trimming code.
 (I think? I wrote this stuff about 8 years ago, so I haven't looked at
 it in a while.)
 
 Also, increasing the hash table size is probably a good idea, especially
 if you reduce how aggressively the cache is trimmed.
 
 Only doing it once/sec would result in a very large cache when
 bursts of
 traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
 This code was originally production tested on a server with 1Gbyte,
 so times have changed a bit;-)
 
 I'm not sure I see why doing it as a separate thread will improve
 things.
 There are N nfsd threads already (N can be bumped up to 256 if you
 wish)
 and having a bunch more cache trimming threads would just increase
 contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on the
 mutex (it will be held less frequently and for shorter periods).
 
 I think the little drc2.patch which will keep the nfsd threads from
 acquiring the mutex and doing the trimming most of the time, might be
 sufficient. I still don't see why a separate trimming thread will be
 an advantage. I'd also be worried that the one cache trimming thread
 won't get the job done soon enough.
 
 When I did production testing on a 1Gbyte server that saw a peak
 load of about 100RPCs/sec, it was necessary to trim aggressively.
 (Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)
 
 The only negative effect I can think of w.r.t. having the nfsd
 threads doing it would be a (I believe negligible) increase in RPC
 response times (the time the nfsd thread spends trimming the cache).
 As noted, I think this time would be negligible compared to disk I/O
 and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
 Have fun with it. Let me know when you have what you think is a good patch.
 
 rick
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
 ___
 freebsd...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-fs
 To unsubscribe, send any mail to freebsd-fs-unsubscr...@freebsd.org

I was doing some NFS testing with RELENG_9 machine and
a Linux RHEL machine over 10G network, and noticed the same nfsd threads issue.

Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server with 
dd if=/tank/32G.bin of=/dev/null bs=1M to cache it completely in ARC (machine 
has 196G RAM),
then if I do this again locally I would get close to 4GB/sec read - completely 
from the cache...

But If I try to read the file over NFS from the Linux machine I would only get 
about 100MB/sec speed, sometimes a bit more,
and all of the nfsd threads are clearly visible in top. pmcstat also showed the 
same mutex contention as in the original post.

I've now applied the drc2 patch, and reruning the same test yields about 
960MB/s transfer over NFS… quite an

Re: NFS server bottlenecks

2012-10-06 Thread Rick Macklem

Nikolay Deney wrote:
 On Oct 4, 2012, at 12:36 AM, Rick Macklem rmack...@uoguelph.ca
 wrote:
 
  Garrett Wollman wrote:
  On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
  rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache
  entry
  is on, rather than a global lock for everything. This would
  reduce
  the mutex contention, but I'm not sure how significantly since I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the lists,
  I
  don't
  see how that can be done with a global lock for list updates?
 
  Well, the global lock is what we have now, but the cache trimming
  process only looks at one list at a time, so not locking the list
  that
  isn't being iterated over probably wouldn't hurt, unless there's
  some
  mechanism (that I didn't see) for entries to move from one list to
  another. Note that I'm considering each hash bucket a separate
  list. (One issue to worry about in that case would be cache-line
  contention in the array of hash buckets; perhaps
  NFSRVCACHE_HASHSIZE
  ought to be increased to reduce that.)
 
  Yea, a separate mutex for each hash list might help. There is also
  the
  LRU list that all entries end up on, that gets used by the trimming
  code.
  (I think? I wrote this stuff about 8 years ago, so I haven't looked
  at
  it in a while.)
 
  Also, increasing the hash table size is probably a good idea,
  especially
  if you reduce how aggressively the cache is trimmed.
 
  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
  My servers have 96 GB of memory so that's not a big deal for me.
 
  This code was originally production tested on a server with
  1Gbyte,
  so times have changed a bit;-)
 
  I'm not sure I see why doing it as a separate thread will improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if you
  wish)
  and having a bunch more cache trimming threads would just
  increase
  contention, wouldn't it?
 
  Only one cache-trimming thread. The cache trim holds the (global)
  mutex for much longer than any individual nfsd service thread has
  any
  need to, and having N threads doing that in parallel is why it's so
  heavily contended. If there's only one thread doing the trim, then
  the nfsd service threads aren't spending time either contending on
  the
  mutex (it will be held less frequently and for shorter periods).
 
  I think the little drc2.patch which will keep the nfsd threads from
  acquiring the mutex and doing the trimming most of the time, might
  be
  sufficient. I still don't see why a separate trimming thread will be
  an advantage. I'd also be worried that the one cache trimming thread
  won't get the job done soon enough.
 
  When I did production testing on a 1Gbyte server that saw a peak
  load of about 100RPCs/sec, it was necessary to trim aggressively.
  (Although I'd be tempted to say that a server with 1Gbyte is no
  longer relevant, I recently recall someone trying to run FreeBSD
  on a i486, although I doubt they wanted to run the nfsd on it.)
 
  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in RPC
  response times (the time the nfsd thread spends trimming the
  cache).
  As noted, I think this time would be negligible compared to disk
  I/O
  and network transit times in the total RPC response time?
 
  With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
  network connectivity, spinning on a contended mutex takes a
  significant amount of CPU time. (For the current design of the NFS
  server, it may actually be a win to turn off adaptive mutexes -- I
  should give that a try once I'm able to do more testing.)
 
  Have fun with it. Let me know when you have what you think is a good
  patch.
 
  rick
 
  -GAWollman
  ___
  freebsd-hackers@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
  To unsubscribe, send any mail to
  freebsd-hackers-unsubscr...@freebsd.org
  ___
  freebsd...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-fs
  To unsubscribe, send any mail to
  freebsd-fs-unsubscr...@freebsd.org
 
 I was doing some NFS testing with RELENG_9 machine and
 a Linux RHEL machine over 10G network, and noticed the same nfsd
 threads issue.
 
 Previously I would read a 32G file locally on the FreeBSD ZFS/NFS
 server with dd if=/tank/32G.bin of=/dev/null bs=1M to cache it
 completely in ARC (machine has 196G RAM),
 then if I do this again locally I would get close to 4GB/sec read -
 completely from the cache...
 
 But If I try to read the file over NFS from the Linux machine I would
 only get about 100MB/sec speed, sometimes a bit more,
 and all of the nfsd threads are clearly visible in top. pmcstat also
 showed the same mutex

Re: NFS server bottlenecks

2012-10-03 Thread Rick Macklem

Garrett Wollman wrote:
 [Adding freebsd-fs@ to the Cc list, which I neglected the first time
 around...]
 
 On Tue, 2 Oct 2012 08:28:29 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  I can't remember (I am early retired now;-) if I mentioned this
  patch before:
http://people.freebsd.org/~rmacklem/drc.patch
  It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater
  that can
  be twiddled so that the drc is trimmed less frequently. By making
  these
  values larger, the trim will only happen once/sec until the high
  water
  mark is reached, instead of on every RPC. The tradeoff is that the
  DRC will
  become larger, but given memory sizes these days, that may be fine
  for you.
 
 It will be a while before I have another server that isn't in
 production (it's on my deployment plan, but getting the production
 servers going is taking first priority).
 
 The approaches that I was going to look at:
 
 Simplest: only do the cache trim once every N requests (for some
 reasonable value of N, e.g., 1000). Maybe keep track of the number of
 entries in each hash bucket and ignore those buckets that only have
 one entry even if is stale.
 
Well, the patch I have does it when it gets too big. This made sense to
me, since the cache is trimmed to keep it from getting too large. It also
does the trim at least once/sec, so that really stale entries are removed.

 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
Well, since the cache trimming is removing entries from the lists, I don't
see how that can be done with a global lock for list updates? A mutex in
each element could be used for changes (not insertion/removal) to an individual
element. However, the current code manipulates the lists and makes minimal
changes to the individual elements, so I'm not sure if a mutex in each element
would be useful or not, but it wouldn't help for the trimming case, imho.

I modified the patch slightly, so it doesn't bother to acquire the mutex when
it is checking if it should trim now. I think this results in a slight risk that
the test will use an out of date cached copy of one of the global vars, but
since the code isn't modifying them, I don't think it matters. This modified
patch is attached and is also here:
   http://people.freebsd.org/~rmacklem/drc2.patch

 Moderately complicated: figure out if a different synchronization type
 can safely be used (e.g., rmlock instead of mutex) and do so.
 
 More complicated: move all cache trimming to a separate thread and
 just have the rest of the code wake it up when the cache is getting
 too big (or just once a second since that's easy to implement). Maybe
 just move all cache processing to a separate thread.
 
Only doing it once/sec would result in a very large cache when bursts of
traffic arrives. The above patch does it when it is too big or at least
once/sec.

I'm not sure I see why doing it as a separate thread will improve things.
There are N nfsd threads already (N can be bumped up to 256 if you wish)
and having a bunch more cache trimming threads would just increase
contention, wouldn't it? The only negative effect I can think of w.r.t.
having the nfsd threads doing it would be a (I believe negligible) increase
in RPC response times (the time the nfsd thread spends trimming the cache).
As noted, I think this time would be negligible compared to disk I/O and network
transit times in the total RPC response time?

Isilon did use separate threads (I never saw their code, so I am going by what
they told me), but it sounded to me like they were trimming the cache too 
agressively
to be effective for TCP mounts. (ie. It sounded to me like they had broken the
algorithm to achieve better perf.)

Remember that the DRC is weird, in that it is a cache to improve correctness at
the expense of overhead. It never improves performance. On the other hand, turn
it off or throw away entries too aggressively and data corruption, due to 
retries
of non-idempotent operations, can be the outcome.

Good luck with whatever you choose, rick

 It's pretty clear from the profile that the cache mutex is heavily
 contended, so anything that reduces the length of time it's held is
 probably a win.
 
 That URL again, for the benefit of people on freebsd-fs who didn't see
 it on hackers, is:
 
  http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png.
 
 (This graph is slightly modified from my previous post as I removed
 some spurious edges to make the formatting look better. Still looking
 for a way to get a profile that includes all kernel modules with the
 kernel.)
 
 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to

Re: NFS server bottlenecks

2012-10-03 Thread Garrett Wollman

On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca 
said:

 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
 Well, since the cache trimming is removing entries from the lists, I don't
 see how that can be done with a global lock for list updates?

Well, the global lock is what we have now, but the cache trimming
process only looks at one list at a time, so not locking the list that
isn't being iterated over probably wouldn't hurt, unless there's some
mechanism (that I didn't see) for entries to move from one list to
another.  Note that I'm considering each hash bucket a separate
list.  (One issue to worry about in that case would be cache-line
contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
ought to be increased to reduce that.)

 Only doing it once/sec would result in a very large cache when bursts of
 traffic arrives.

My servers have 96 GB of memory so that's not a big deal for me.

 I'm not sure I see why doing it as a separate thread will improve things.
 There are N nfsd threads already (N can be bumped up to 256 if you wish)
 and having a bunch more cache trimming threads would just increase
 contention, wouldn't it?

Only one cache-trimming thread.  The cache trim holds the (global)
mutex for much longer than any individual nfsd service thread has any
need to, and having N threads doing that in parallel is why it's so
heavily contended.  If there's only one thread doing the trim, then
the nfsd service threads aren't spending time either contending on the
mutex (it will be held less frequently and for shorter periods).

 The only negative effect I can think of w.r.t.  having the nfsd
 threads doing it would be a (I believe negligible) increase in RPC
 response times (the time the nfsd thread spends trimming the cache).
 As noted, I think this time would be negligible compared to disk I/O
 and network transit times in the total RPC response time?

With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
network connectivity, spinning on a contended mutex takes a
significant amount of CPU time.  (For the current design of the NFS
server, it may actually be a win to turn off adaptive mutexes -- I
should give that a try once I'm able to do more testing.)

-GAWollman
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-03 Thread Rick Macklem

Garrett Wollman wrote:
 On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
 rmack...@uoguelph.ca said:
 
  Simple: just use a sepatate mutex for each list that a cache entry
  is on, rather than a global lock for everything. This would reduce
  the mutex contention, but I'm not sure how significantly since I
  don't have the means to measure it yet.
 
  Well, since the cache trimming is removing entries from the lists, I
  don't
  see how that can be done with a global lock for list updates?
 
 Well, the global lock is what we have now, but the cache trimming
 process only looks at one list at a time, so not locking the list that
 isn't being iterated over probably wouldn't hurt, unless there's some
 mechanism (that I didn't see) for entries to move from one list to
 another. Note that I'm considering each hash bucket a separate
 list. (One issue to worry about in that case would be cache-line
 contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
 ought to be increased to reduce that.)
 
Yea, a separate mutex for each hash list might help. There is also the
LRU list that all entries end up on, that gets used by the trimming code.
(I think? I wrote this stuff about 8 years ago, so I haven't looked at
 it in a while.)

Also, increasing the hash table size is probably a good idea, especially
if you reduce how aggressively the cache is trimmed.

  Only doing it once/sec would result in a very large cache when
  bursts of
  traffic arrives.
 
 My servers have 96 GB of memory so that's not a big deal for me.
 
This code was originally production tested on a server with 1Gbyte,
so times have changed a bit;-)

  I'm not sure I see why doing it as a separate thread will improve
  things.
  There are N nfsd threads already (N can be bumped up to 256 if you
  wish)
  and having a bunch more cache trimming threads would just increase
  contention, wouldn't it?
 
 Only one cache-trimming thread. The cache trim holds the (global)
 mutex for much longer than any individual nfsd service thread has any
 need to, and having N threads doing that in parallel is why it's so
 heavily contended. If there's only one thread doing the trim, then
 the nfsd service threads aren't spending time either contending on the
 mutex (it will be held less frequently and for shorter periods).
 
I think the little drc2.patch which will keep the nfsd threads from
acquiring the mutex and doing the trimming most of the time, might be
sufficient. I still don't see why a separate trimming thread will be
an advantage. I'd also be worried that the one cache trimming thread
won't get the job done soon enough.

When I did production testing on a 1Gbyte server that saw a peak
load of about 100RPCs/sec, it was necessary to trim aggressively.
(Although I'd be tempted to say that a server with 1Gbyte is no
 longer relevant, I recently recall someone trying to run FreeBSD
 on a i486, although I doubt they wanted to run the nfsd on it.)

  The only negative effect I can think of w.r.t. having the nfsd
  threads doing it would be a (I believe negligible) increase in RPC
  response times (the time the nfsd thread spends trimming the cache).
  As noted, I think this time would be negligible compared to disk I/O
  and network transit times in the total RPC response time?
 
 With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
 network connectivity, spinning on a contended mutex takes a
 significant amount of CPU time. (For the current design of the NFS
 server, it may actually be a win to turn off adaptive mutexes -- I
 should give that a try once I'm able to do more testing.)
 
Have fun with it. Let me know when you have what you think is a good patch.

rick

 -GAWollman
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: NFS server bottlenecks

2012-10-02 Thread Rick Macklem

Garrett Wollman wrote:
 I had an email conversation with Rick Macklem about six months ago
 about NFS server bottlenecks. I'm now in a position to observe my
 large-scale NFS server under an actual production load, so I thought I
 would update folks on what it looks like. This is a 9.1 prerelease
 kernel (I hope 9.1 will be released soon as I have four moe of these
 servers to deploy!). When under nearly 100% load on an 8-core
 (16-thread) Quanta QSSC-S99Q storage server, with a 10G network
 interface, pmcstat tells me this:
 
 PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved
 Key: q = exiting...
 %SAMP IMAGE FUNCTION CALLERS
 29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0
 nfsrvd_getcache:7.4 ...
 9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4
 7.4 zfs.ko lzjb_decompress zio_decompress
 4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ...
 4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1
 zio_checksum_compute:0.8
 3.6 kernel cpu_search_lowest cpu_search_lowest
 3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6
 2.3 kernel ipfw_chk ipfw_check_hook
 2.1 pmcstat _init
 1.1 kernel _sx_xunlock
 0.9 kernel _sx_xlock
 0.9 kernel spinlock_exit
 
 This does seem to confirm my original impression that the NFS replay
 cache is quite expensive. Running a gprof(1) analysis on the same PMC
 data reveals a bit more detail (I've removed some uninteresting parts
 of the call graph):
 
I can't remember (I am early retired now;-) if I mentioned this patch before:
  http://people.freebsd.org/~rmacklem/drc.patch
It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can
be twiddled so that the drc is trimmed less frequently. By making these
values larger, the trim will only happen once/sec until the high water
mark is reached, instead of on every RPC. The tradeoff is that the DRC will
become larger, but given memory sizes these days, that may be fine for you.

jwd@ was going to test it, but he moved to a different job away from NFS, so
the patch has just been collecting dust.

If you could test it, that would be nice, rick
ps: Also, the current patch still locks before checking if it needs to do the
trim. I think that could safely be changed so that it doesn't lock/unlock
when it isn't doing the trim, if that makes a significant difference.

 
 called/total parents
 index %time self descendents called+self name index
 called/total children
 4881.00 2004642.70 932627/932627 svc_run_internal [2]
 [4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4]
 13199.00 504436.33 584319/584319 nfsrvd_updatecache [9]
 23075.00 403396.18 468009/468009 nfsrvd_getcache [14]
 1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15]
 6168.00 381770.44 11618/11618 nfsrvd_dorpc [24]
 3526.87 86869.88 112478/112514 nfsrvd_sentcache [74]
 890.00 50540.89 4252/4252 svc_getcred [101]
 14876.60 32394.26 4177/24500 crfree cycle 3 [263]
 11550.11 25150.73 3243/24500 free cycle 3 [102]
 1348.88 15451.66 2716/16831 m_freem [59]
 4066.61 216.81 1434/1456 svc_freereq [321]
 2342.15 677.40 557/1459 malloc_type_freed [265]
 59.14 1916.84 134/2941 crget [113]
 1602.25 0.00 322/9682 bzero [105]
 690.93 0.00 43/44 getmicrotime [571]
 287.22 7.33 138/1205 prison_free [384]
 233.61 0.00 60/798 PHYS_TO_VM_PAGE [358]
 203.12 0.00 94/230 nfsrv_mallocmget_limit [632]
 151.76 0.00 51/1723 pmap_kextract [309]
 0.78 70.28 9/3281 _mtx_unlock_sleep [154]
 19.22 16.88 38/400403 nfsrc_trimcache [26]
 11.05 21.74 7/197 crsetgroups [532]
 30.37 0.00 11/6592 critical_enter [190]
 25.50 0.00 9/36 turnstile_chain_unlock [844]
 24.86 0.00 3/7 nfsd_errmap [913]
 12.36 8.57 8/2145 in_cksum_skip [298]
 9.10 3.59 5/12455 mb_free_ext [140]
 1.84 4.85 2/2202 VOP_UNLOCK_APV [269]
 
 ---
 
 0.49 0.15 1/1129009 uhub_explore [1581]
 0.49 0.15 1/1129009 tcp_output [10]
 0.49 0.15 1/1129009 pmap_remove_all [1141]
 0.49 0.15 1/1129009 vm_map_insert [236]
 0.49 0.15 1/1129009 vnode_create_vobject [281]
 0.49 0.15 1/1129009 biodone [351]
 0.49 0.15 1/1129009 vm_object_madvise [670]
 0.49 0.15 1/1129009 xpt_done [483]
 0.49 0.15 1/1129009 vputx [80]
 0.49 0.15 1/1129009 vm_map_delete cycle 3 [49]
 0.49 0.15 1/1129009 vm_object_deallocate cycle 3 [356]
 0.49 0.15 1/1129009 vm_page_unwire [338]
 0.49 0.15 1/1129009 pmap_change_wiring [318]
 0.98 0.31 2/1129009 getnewvnode [227]
 0.98 0.31 2/1129009 pmap_clear_reference [1004]
 0.98 0.31 2/1129009 usbd_do_request_flags [1282]
 0.98 0.31 2/1129009 vm_object_collapse cycle 3 [587]
 0.98 0.31 2/1129009 vm_object_page_remove [122]
 1.48 0.46 3/1129009 mpt_pci_intr [487]
 1.48 0.46 3/1129009 pmap_extract [355]
 1.48 0.46 3/1129009 vm_fault_unwire [171]
 1.97 0.62 4/1129009 vgonel [270]
 1.97 0.62 4/1129009 vm_object_shadow [926]
 1.97 0.62 4/1129009 zone_alloc_item [434]
 2.46 0.77 5/1129009 vnlru_free [235]
 2.46 0.77 5/1129009 insmntque1 [737]
 2.95 0.93 6/1129009 zone_free_item [409]
 3.94 1.24

Re: NFS server bottlenecks

2012-10-02 Thread Garrett Wollman

[Adding freebsd-fs@ to the Cc list, which I neglected the first time
around...]

On Tue, 2 Oct 2012 08:28:29 -0400 (EDT), Rick Macklem rmack...@uoguelph.ca 
said:

 I can't remember (I am early retired now;-) if I mentioned this patch before:
   http://people.freebsd.org/~rmacklem/drc.patch
 It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can
 be twiddled so that the drc is trimmed less frequently. By making these
 values larger, the trim will only happen once/sec until the high water
 mark is reached, instead of on every RPC. The tradeoff is that the DRC will
 become larger, but given memory sizes these days, that may be fine for you.

It will be a while before I have another server that isn't in
production (it's on my deployment plan, but getting the production
servers going is taking first priority).

The approaches that I was going to look at:

Simplest: only do the cache trim once every N requests (for some
reasonable value of N, e.g., 1000).  Maybe keep track of the number of
entries in each hash bucket and ignore those buckets that only have
one entry even if is stale.

Simple: just use a sepatate mutex for each list that a cache entry
is on, rather than a global lock for everything.  This would reduce
the mutex contention, but I'm not sure how significantly since I
don't have the means to measure it yet.

Moderately complicated: figure out if a different synchronization type
can safely be used (e.g., rmlock instead of mutex) and do so.

More complicated: move all cache trimming to a separate thread and
just have the rest of the code wake it up when the cache is getting
too big (or just once a second since that's easy to implement).  Maybe
just move all cache processing to a separate thread.

It's pretty clear from the profile that the cache mutex is heavily
contended, so anything that reduces the length of time it's held is
probably a win.

That URL again, for the benefit of people on freebsd-fs who didn't see
it on hackers, is:

 http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png.

(This graph is slightly modified from my previous post as I removed
some spurious edges to make the formatting look better.  Still looking
for a way to get a profile that includes all kernel modules with the
kernel.)

-GAWollman
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

45 matches

Mail list logo