Hello experts,

We have quite a number of NFS servers running OmniOS, but the very latest
hardware is giving us some grief and I was hoping to get some assistance in
finding out why.


SunOS nfs02 5.11 omnios-b5093df i86pc i386 i86pc
  OmniOS v11 r151016

MB : MBD-X10DRH- iT  (Xeon Supermicro)
CPU: E5-2650V4 2.2GHz x 2 (48 cores)
Mem: Intel Xeon BDW-EP DDR4-2400 ECC REG 32GB x 12 (384G)

ZFS pool of 24 HDDs, serving NFSv4 clients.

This is the first server of this hardware type, and the first we are
experiencing troubles with. The older servers are generally 32 core.




In normal situations, the load is higher than expected (at least compared
to what the load was on Solaris 10 that we are replacing.) But possibly it
is just that the loadavg math has changed.

last pid:  3198;  load avg:  5.83,  7.31,  8.29;  up 0+00:40:39        15:10:30
63 processes: 62 sleeping, 1 on cpu
CPU states: 89.3% idle,  0.0% user, 10.7% kernel,  0.0% iowait,  0.0% swap
Kernel: 97979 ctxsw, 20 trap, 96827 intr, 208 syscall
Memory: 384G phys mem, 337G free mem, 4096M total swap, 4096M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
   935 daemon    666  60  -20 9544K 8740K sleep  221:20  7.52% nfsd
   933 root       15  59    0 5328K 3476K sleep    0:02  0.00% mountd
  1666 root        1  59    0   73M   71M cpu/24   0:02  0.00% top
   318 root       34  59    0 8812K 5448K sleep    0:01  0.00% nscd

(Recently rebooted as it panicked. Alas, no information on why in logs nor
dump - looks especially evil at the moment).


Then from time to time, it goes crazy, loads goes over 50, nfsd threads
drop to about 120. All NFS clients spew messages regarding NR_BAD_SEQID and
NFS4ERR_STALE.

Sometimes it recovers, sometimes it reboots. It has been armed with dump
now, in case it crashes again.

During idle time, flamegraph stacks are mostly in unix`acpi_cpu_cstate and
i86_mwait.

( flamegraph here: http://www.lundman.net/nfs02-idle.svg )

During the last load 50, flamegraph showed it to be busy in
rfs4_findstate_by_owner_file > rfs4_dbsearch > vmem_nextfit_alloc.

( flamegraph here: http://www.lundman.net/nfs02-busy.svg )

Although, considering how much memory is free (337G) should it be blocking
there?

I've been trying to find anything of interest on the server, but I'm unsure
what it going on. I have gone through many of the DTraceToolkit tools as
well. Request any output wanted!

r151016 is a bit old, especially on newest hardware, but going through the
illumos commit log, I only found "7912 nfs_rwlock readers are running wild
waiting" in the nfs area.


Cheers,

Lund


-- 
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)90-5578-8500
Shibuya-ku, Tokyo    | Japan


------------------------------------------
illumos-discuss
Archives: 
https://illumos.topicbox.com/groups/discuss/discussions/T1f149f6156a80f52-M8ee746805dc1044096166714
Powered by Topicbox: https://topicbox.com

Reply via email to