(and for added fun I had to subscribe to send this, since it's apparently too hard to set up majordomo to allow people to subscribe in a post-only capacity, and things don't get forwarded from non-subscribers in a timely manner. whatever) Figured I should mention this stuff here in case anyone else is beating their head against the wall because of this stuff. Since upgrading our AFS DB servers to 3.5 we've had several problems. The most serious of these is the fact that our user accounts process, which nightly would go out and check on the status of deleted accounts, started taking 28 hours instead of 2 or so to run. After some work on the issue it turned out that some VL RPCs which previously had run almost instantly were now taking 3 seconds to complete in some cases. The problem manifests itself worse in certain cases. What's happening appears to be that a change to how RX server listeners are dealt with has several side effects. One, affecting this issue, is that the listener can end up servicing requests and runs at a higher priority typically than other threads. So, if it services a long-running RPC, like one listing all the volumes in your cell, other RPCs get starved. A similar problem manifests itself in the volserver apparently. Another is that the thread stack size of service threads doesn't apply to the listener, meaning if you have an RPC being serviced by the listener which requires lots of stack, it may lose. In our case the big loss here is the adm server, which while not a Transarc product is linked against the rx/rxkad/lwp they provide. So this is a double-whammy for us. Reports have been submitted, one less than an hour ago, so hopefully something will happen. The thing that bugs me is that for a long time I had a fix for something else I'd been pressing for inclusion (the "new kaserver interrealm key creation results in keys not useable through the kaserver udp interface) and was told that it needed to be tested and hence that's why I hadn't seen it and it would take a while, and well, you'd think testing would have caught this if there's so much testing to be done in order to make any change of consequence. On the other hand, Transarc has been much quicker to implement fixes lately for things we've submitted, so I probably should be less grumpy... but that's so out of character for me;-) I'm going back to try to patch things up here so we can get back in business. -D
