On Wed, 9 Nov 2011, Andrew Deason wrote:

On Wed, 9 Nov 2011 14:38:10 -0500 (EST)
Kevin Hildebrand <[email protected]> wrote:

What I'm observing is that as soon as the vos release begins, one or
more of the readonly replicas start accumulating connections in the
'error' state.

The connection has an error, or individual calls? I'm assuming you are
seeing this via rxdebug; do you see

Connection from host x.x.x.x, port x, Cuid x, error x

or do you see

   call x: # x, state active, mode: error

Or better yet, just give specifically what you see :)


For example:

Connection from host 129.2.56.137, port 7001, Cuid a59e5fd1/37e0f99c
  serial 266,  natMTU 1444, security index 0, server conn
    call 0: # 220, state active, mode: error
    call 1: # 21, state dally, mode: eof, flags: receive_done
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized
Connection from host 128.8.163.75, port 7001, Cuid 96a0fa27/38de8e1c
  serial 86,  natMTU 1444, security index 0, server conn
    call 0: # 26, state active, mode: error
    call 1: # 23, state not initialized
    call 2: # 0, state not initialized
    call 3: # 0, state not initialized

FileLog shows incoming FetchStatus RPCs to that replica are not being
answered.  If this condition occurs long enough, all of these
connections eventually fill up the thread pool and the fileserver
stops serving data to everything else.

At some point, up to five minutes later, as the release proceeds, the
replica in question gets marked offline by the release process.  At
this time, all of the stuck RPCs get 'FetchStatus returns 106'
(VOFFLINE), at which point the connection pool clears, and life on the
fileserver returns to normal.

There is a known situation in which a client can hold a reference to the
volume for longish periods of time, which prevents the volume from going
offline and causes some responses to hang and build up. But there's some
related fixes for it; what versions are in play here?


1.4.14, clients and servers.

What I can't figure out is what's going on during the time the RPCs
are hung, and why the connections show 'error'.  (How does one
determine what the error condition is, when viewing rxdebug output?)
Why would an RO replica be hung during a vos release?

You can see where the threads are hanging by getting a backtrace of all
of the threads. You can run 'pstack <fileserver pid>' to get this, or
generate a core and examine with a debugger. If you're on Linux, run
'gcore <fileserver pid>' and run 'gdb <fileserver binary> <core>' then
do something like:

(gdb) set height 0
(gdb) set width 0
(gdb) set logging file /tmp/some/file
(gdb) set logging on
(gdb) thread apply all bt
(gdb) quit

And put that output up somewhere. There might be a little sensitive
information in that (filenames wold be the most likely thing), but you
should be able to tell whether or not you care by just looking at it. If
the issue I mention above is relevant, if I recall correctly you'll see
several threads inside VGetVolume_r or similar, one of which being
inside VOffline_r.


See /afs/glue.umd.edu/home/glue/k/e/kevin/pub/afs_debug/stacktrace. You are correct, most threads are in VGetVolume_r or VOffline_r.

And in regards to Derrick's request for the timed vos release, if still needed I'll tackle that tomorrow morn.

Thanks,
Kevin


--
Andrew Deason
[email protected]

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to