So I was eventually able to find the bad client- for some reason one particular machine was taking almost 10 minutes to read a 2 meg file out of the volume in question- even though the network between the fileserver and that machine seemed to be okay. I ended up rebooting the client, and now it seems happy again.

In case anyone else ends up in this situation, I ended up finding the bad client from some rxdebug output I had saved during a hang- the only client still doing anything looked like this:

Connection from host 129.2.163.45, port 7001, Cuid 9e59be0c/3ae9a2bc
  serial 3466,  natMTU 1444, security index 0, server conn
    call 0: # 78, state dally, mode: eof, flags: receive_done
call 1: # 233, state active, mode: sending, flags: window_send receive_done, has_output_packets
    call 2: # 124, state dally, mode: eof, flags: receive_done
    call 3: # 0, state not initialized

And it was also the last entry in the log before the volume went offline-

Wed Nov  9 11:32:22 2011 [33] SRXAFS_FetchData, Fid = 1970897351.3208.3134057
Wed Nov  9 11:32:22 2011 [33] SRXAFS_FetchData, Fid = 1970897351.3208.3134057, 
Host 129.2.163.45:7001, Id 32766
Wed Nov  9 11:32:22 2011 [33] FetchData_RXStyle: Pos 0, Len 1048576
Wed Nov  9 11:32:22 2011 [33] FetchData_RXStyle: file size 3600423
...
Wed Nov  9 11:37:31 2011 [33] VOffline: Volume 1970897351 (s.common.readonly) 
is now offlineWed Nov  9 11:37:31 2011 [33]  (A volume utility is running.)Wed 
Nov  9 11:37:31 2011 [33]
Wed Nov  9 11:37:31 2011 [33] SRXAFS_FetchData returns 0

...followed by all of the hung clients getting freed up with VOFFLINE errors

Wed Nov  9 11:37:31 2011 [6] SAFS_FetchStatus returns 106
Wed Nov  9 11:37:31 2011 [9] SAFS_FetchStatus returns 106
Wed Nov  9 11:37:31 2011 [96] SAFS_FetchStatus returns 106

Kevin




On Wed, 9 Nov 2011, Kevin Hildebrand wrote:


Excellent.  I'm glad that there's a known cause and a fix to boot.  This
is a big help.  I'll track down the recalcitrant client anyway for the
sake of completeness, and that will likely speed up the vos release.
Though you are correct, I don't really care how long the release takes, as
long as it's not blocking clients from accessing data.

Thanks a bunch for your help!

Kevin

On Wed, 9 Nov 2011, Andrew Deason wrote:

On Wed, 9 Nov 2011 17:49:51 -0500 (EST)
Kevin Hildebrand <[email protected]> wrote:

For example:

Connection from host 129.2.56.137, port 7001, Cuid a59e5fd1/37e0f99c
   serial 266,  natMTU 1444, security index 0, server conn
     call 0: # 220, state active, mode: error

Okay. I don't think we expose the error code anywhere over the wire for
rxdebug. You can get the error code either by looking at a core of the
fileserver process and looking in the rx call structures, or by looking
at a packet trace at around the time this happens (you should see an rx
abort packet go by, which will have the abort code in it).

See /afs/glue.umd.edu/home/glue/k/e/kevin/pub/afs_debug/stacktrace.  You
are correct, most threads are in VGetVolume_r or VOffline_r.

Yes, this appears to be the issue I was talking about. The client
issuing the FetchData64 call in Thread 191/80 is holding a reference to
the volume, and is (presumably) not consuming the data very quickly. The
release will not continue and the other clients will not be able to be
serviced until it finishes that FetchData64 call. Knowing what the
client is requires examining a fileserver core. (Or you might be able to
deduce it from looking at network traffic, if you really wanted to)

Anyway, you want this patch:
<http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=2ad34a27105e591f40652e1a454ea7dc458686a1>
That will not make the release go any faster, but it will prevent the
other calls that occur at the same time from hanging. Instead, they
should fail over to other available RO sites.

If you want the release to go faster, there is a set of patches that
allows you to specify a timeout for this situation, after which the
problematic client will get kicked off so the release can proceed. Those
changes are a bit more involved, though; I wouldn't bother with it
unless the release delays are a problem for you.

--
Andrew Deason
[email protected]

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info


_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to