Jorgen Lundman wrote:
> 
> Hello again,
> 
> 
> Sorry for the delay in reply, spent the weekend monitoring and
> troubleshooting the box. Ever since the panic/reboot, its performance has
> been stellar, so I think we can rule out the "new hardware" angle.

So I wrote that both in the hopes that it was true, and to defy fate. As it
happens, it did not take long to get back at me.

So last night, we had another load-spike at 2017-11-27 22:27:39
(3mins apart)

CRITICAL - load average: 80.34, 46.06, 22.04
CRITICAL - load average: 35.05, 46.92, 27.12
CRITICAL - load average: 8.64, 29.14, 23.55
CRITICAL - load average: 11.88, 20.95, 21.21
WARNING - load average: 5.79, 13.97, 18.36

Naturally, the CGI/NFS-clients reacted - there are always background NFS
errors - but they spiked as well:

# grep '^Nov 27 20:' messages|grep NFS4ERR| wc -l
       6
# grep '^Nov 27 21:' messages|grep NFS4ERR| wc -l
       9
# grep '^Nov 27 22:' messages|grep NFS4ERR| wc -l
     111
# grep '^Nov 27 23:' messages|grep NFS4ERR| wc -l
      45
# grep '^Nov 28 00:' messages|grep NFS4ERR| wc -l
      33
# grep '^Nov 28 01:' messages|grep NFS4ERR| wc -l
      13

On the clients, these are constant:

Nov 27 22:15:57 cgi05 nfs: [ID 286389 kern.info] NOTICE: [NFS4][Server:
nfs02-cgi][Mntpt: /export/www]File
./customer-path-hidden/wordpress-4.4.1-ja-jetpack-undernavicontrol/wp-content/cache/db/000000/options/534/421
(rnode_pt: ffffffffd4b1d840) was closed due to NFS recovery error on server
nfs02-cgi(failed to recover from NFS4ERR_STALE NFS4ERR_STALE)

On the clients, not sure I've seen these before:

Nov 27 22:35:49 cgi05 nfs: [ID 435015 kern.info] NOTICE: [NFS4][Server:
nfs02-cgi][Mntpt: /export/www]Operation open for file
./customer-path-hidden/index.php (rnode_pt 0xffffffffdb963bb0), pid 0 using
seqid 1 got NFS4ERR_BAD_SEQID.  Last good seqid was 0 for operation .



This is currently the only NFS server that has these spikes, no messages on
that NFS server anywhere.



> Jorgen, could you decrease number of nfs server threads to 256 and check 
> behaviour again ?


The graphs for NFS threads are flat at about ~250 since the reboot 5 days
ago, and dropped just after 22:00 (coinciding with the trouble) to around
~150. Where it is currently at. The drop itself is unusual, the other
servers do not drop.

Even though we set NFS threads max to 1024, we find it to be quite
self-regulating, since it is "mostly" based on the number of NFS clients
mounting. The other NFS servers (not having spikes) are also flat around
~235 (looking at monthly graphs).


> It seems like the problem is at id_alloc() which uses the vmem framework to 
> allocate unique ids.
> In particular, vmem_nextfit_alloc() is the one that is responsible for your 
> slowness as its operation is single threaded.
> I’m somewhat confused by its implementation but my hunch is that it doesn’t 
> scale well to 48 CPUs.

The system itself:
Memory: 384G phys mem, 17G free mem, 4096M total swap, 4096M free swap

But that does not give any insight into the VM arena backing space_it_t, so
I'm unsure what I can do there. If I had to guess, it would not matter if I
limit the ARC, since we are talking about a specific arena.

Can I boot with half the CPUs disabled? We would rather do that on the live
system, and get back to debugging the problem on test hardware, if at all
possible.

> It would be interesting to see what the vmem arena backing that space_id_t 
> resource looks like.

Yes, I agree - tell me how to do that :)



> Please look at fmdump output. It should show stacks even if you didn't save a 
> crash dump.

Alas, only has the filesystem/local alert, and correction by myself.







-- 
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)90-5578-8500
Shibuya-ku, Tokyo    | Japan


------------------------------------------
illumos-discuss
Archives: 
https://illumos.topicbox.com/groups/discuss/discussions/T1f149f6156a80f52-M57bf426cb316c0e8b6a7c67a
Powered by Topicbox: https://topicbox.com

Reply via email to