RPC timeouts and other fun with afs 3.5

Derrick J Brashear Mon, 17 Jan 2000 23:59:54 +0100 (MET)
(and for added fun I had to subscribe to send this, since it's apparently
too hard to set up majordomo to allow people to subscribe in a post-only
capacity, and things don't get forwarded from non-subscribers in a timely 
manner. whatever)

Figured I should mention this stuff here in case anyone else is beating
their head against the wall because of this stuff.

Since upgrading our AFS DB servers to 3.5 we've had several problems. The
most serious of these is the fact that our user accounts process, which
nightly would go out and check on the status of deleted accounts, started
taking 28 hours instead of 2 or so to run. After some work on the issue it
turned out that some VL RPCs which previously had run almost instantly
were now taking 3 seconds to complete in some cases. The problem manifests
itself worse in certain cases. 

What's happening appears to be that a change to how RX server listeners
are dealt with has several side effects. One, affecting this issue, is
that the listener can end up servicing requests and runs at a higher
priority typically than other threads. So, if it services a long-running
RPC, like one listing all the volumes in your cell, other RPCs get
starved. A similar problem manifests itself in the volserver apparently.

Another is that the thread stack size of service threads doesn't apply to
the listener, meaning if you have an RPC being serviced by the listener
which requires lots of stack, it may lose. In our case the big loss here
is the adm server, which while not a Transarc product is linked against
the rx/rxkad/lwp they provide. 

So this is a double-whammy for us.

Reports have been submitted, one less than an hour ago, so hopefully
something will happen. The thing that bugs me is that for a long time I
had a fix for something else I'd been pressing for inclusion (the "new
kaserver interrealm key creation results in keys not useable through the
kaserver udp interface) and was told that it needed to be tested and hence
that's why I hadn't seen it and it would take a while, and well, you'd
think testing would have caught this if there's so much testing to be done
in order to make any change of consequence.

On the other hand, Transarc has been much quicker to implement fixes
lately for things we've submitted, so I probably should be less grumpy...
but that's so out of character for me;-)

I'm going back to try to patch things up here so we can get back in
business.

-D
RPC timeouts and other fun with afs 3.5

Reply via email to