I've been having a problems lately with runaway processes continually
doing fetches from a fileserver, hundreds of times per second.  Net
traffic typically shows a bunch of fetchstatus/fetchdata ops from
client to server, with rx "abort" packets being sent back from the
server to client.  The apparent cause is expired user tokens and
processes still trying to access files in protected space.

Has anyone else seen this?  What fix did you come up with?  Here's an
example (with 5.38 server and client) of the behavior I'm talking
about:

  > cd
  > mkdir foo
  > touch foo/bar
  > fs sa foo system:anyuser l -clear
  > cd foo
  > fs flushv bar
  > unlog
  > perl -e 'while (1) { stat "bar"; sleep 1 }'

At this point, the AFS traffic to the fileserver looks like:

  17:46:35.098717 [c->s] fetchstatus 536939212..181142514 (DF)  (vol/inode of
  17:46:35.099141 [s->c] rx abort (DF)                            "foo/bar")
  17:46:36.092048 [c->s] fetchstatus 536939212..181142514 (DF)
  17:46:36.092524 [s->c] rx abort (DF)
  17:46:37.092415 [c->s] fetchstatus 536939212..181142514 (DF)
  17:46:37.092543 [s->c] rx abort (DF)
  17:46:38.092045 [c->s] fetchstatus 536939212..181142514 (DF)
  17:46:38.092509 [s->c] rx abort (DF)

once per second, every time the stat is done.  Get the file back in
the cache again ("ls bar" from another window and pag), and the
traffic goes away.

I did some more experimenting with various directory acls and token
status, and sometimes the stat() caused net traffic and sometimes it
didn't.  Sometimes it looked like a directory read (fetchdata slice at
0, size 999999999).  I didn't play with it long enough to develop a
theory, but the inconsistency does seem to point to a possible bug in
the cache manager.  (I keep vacillating between "the cache should
cache the info" and "the cache needs to hit the server each time to
see if things have changed".)

This morning, I had a fileserver getting 150k RPC calls every 120
seconds and readonly volumes with 65M accesses because of this,
so any insight into the problem would be greatly appreciated.

Thanks!


William

Reply via email to