----- Original Message ----- From: "Rick Macklem" <[email protected]>
To: "Michael Tratz" <[email protected]>
Cc: <[email protected]>
Sent: Thursday, July 25, 2013 1:25 AM
Subject: Re: NFS deadlock on 9.2-Beta1


Michael Tratz wrote:
Two machines (NFS Server: running ZFS / Client: disk-less), both are
running FreeBSD r253506. The NFS client starts to deadlock processes
within a few hours. It usually gets worse from there on. The
processes stay in "D" state. I haven't been able to reproduce it
when I want it to happen. I only have to wait a few hours until the
deadlocks occur when traffic to the client machine starts to pick
up. The only way to fix the deadlocks is to reboot the client. Even
an ls to the path which is deadlocked, will deadlock ls itself. It's
totally random what part of the file system gets deadlocked. The NFS
server itself has no problem at all to access the files/path when
something is deadlocked on the client.

Last night I decided to put an older kernel on the system r252025
(June 20th). The NFS server stayed untouched. So far 0 deadlocks on
the client machine (it should have deadlocked by now). FreeBSD is
working hard like it always does. :-) There are a few changes to the
NFS code from the revision which seems to work until Beta1. I
haven't tried to narrow it down if one of those commits are causing
the problem. Maybe someone has an idea what could be wrong and I can
test a patch or if it's something else, because I'm not a kernel
expert. :-)

Well, the only NFS client change committed between r252025 and r253506
is r253124. It fixes a file corruption problem caused by a previous
commit that delayed the vnode_pager_setsize() call until after the
nfs node mutex lock was unlocked.

If you can test with only r253124 reverted to see if that gets rid of
the hangs, it would be useful, although from the procstats, I doubt it.

I have run several procstat -kk on the processes including the ls
which deadlocked. You can see them here:

http://pastebin.com/1RPnFT6r

All the processes you show seem to be stuck waiting for a vnode lock
or in __utmx_op_wait. (I`m not sure what the latter means.)

What is missing is what processes are holding the vnode locks and
what they are stuck on.

A starting point might be ``ps axhl``, to see what all the threads
are doing (particularily the WCHAN for them all). If you can drop into
the debugger when the NFS mounts are hung and do a ```show alllocks``
that could help. See:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html

I`ll admit I`d be surprised if r253124 caused this, but who knows.

If there have been changes to your network device driver between
r252025 and r253506, I`d try reverting those. (If an RPC gets stuck
waiting for a reply while holding a vnode lock, that would do it.)

Good luck with it and maybe someone else can think of a commit
between r252025 and r253506 that could cause vnode locking or network
problems.

You could break to the debugger when it happens and run:
show sleepchain
and
show lockchain
to see whats waiting on what.

   Regards
   Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.
In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to [email protected].

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Reply via email to