If you have tcpdump data for cache manager <-> vlserver and cache
manager <-> fileserver traffic during one of these corruptions, that
could be very helpful.  I've found tcpdump (or wireshark/tshark) to be
useful in tracking down issues like this because you can very quickly
see if the problem is

1- cache manager asking for the wrong thing to start with (possibly
cache corruption -- not conclusive because you have to determine if
the cache manager got the bad data and cached it, or if the cache
manager 'broke' the data; picking one client and clearing it's cache,
then re-trying can help answer that question).  Note that  in your
case, this is pretty unlikely, given that you saw it across multiple
clients on mutiple OSes.
2- vlserver giving a wrong answer
3- neither of the above, which means the fileserver is giving a wrong answer.

The usual suspects (e.g., cmdebug) are also helpful here.  It might
also be useful to get the callback state from the fileservers to see
what they think the cache managers have for data (if in case 3 above).
 Given that 'failed volume moves' seem to have been a trigger for
this, logfiles might have something interesting, especially if you can
provide volume names & volume id's for the X-volumes'

-- 
Steven Jenkins
End Point Corporation
http://www.endpoint.com/
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to