We've been having unusual slowness and hangs at times on some of our fileservers, and I think I have a handle on the sequence of events, if not the cause. I could use some assistance in filling in the gaps so I can see if we can fix things.
Right now, I have a heavily used volume (by many clients) that is released on a frequent basis (as often as every ten minutes). This volume has three read-only replicas. The volume is about 200MB in size.
What I'm observing is that as soon as the vos release begins, one or more of the readonly replicas start accumulating connections in the 'error' state. FileLog shows incoming FetchStatus RPCs to that replica are not being answered. If this condition occurs long enough, all of these connections eventually fill up the thread pool and the fileserver stops serving data to everything else.
At some point, up to five minutes later, as the release proceeds, the replica in question gets marked offline by the release process. At this time, all of the stuck RPCs get 'FetchStatus returns 106' (VOFFLINE), at which point the connection pool clears, and life on the fileserver returns to normal.
What I can't figure out is what's going on during the time the RPCs are hung, and why the connections show 'error'. (How does one determine what the error condition is, when viewing rxdebug output?)
Why would an RO replica be hung during a vos release? Any clues on where to look next would be appreciated. Thanks, Kevin -- Kevin Hildebrand University of Maryland, College Park Office of Information Technology _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
