>We are experiencing some "timeout" problems. We are using some cadence tools
>and from time to time, a client lost contact with a server. The result is
>that the library that the cadence tool is using is corrupted. If I do
>a fs listq the system reports a lost contact, and after a while the system
>is up again. ( fs listq reports what it should). Do you have any tools, or
>any way to try to tune our system ( afs ? network ??)? We try to use
>rxdebug, and cmdebug, but it turns out to be painfull and did not give us
>full answers.
>We are using afs 3.2b and rs6000 machine at aix 3.2.2 and aix 3.2.4.
This might be the famous CopyOnWrite() problem. I'm not sure at which release
they fixed the problem, but the problem can be described like this:
Assuming you have a backup volume, and the file (in this case the cadence
library) is changed, then since the backup volume and the regular volume both
point at the same file, it makes a copy of itself before allowing it to be
modified. If the file is huge, say 10 mb, then it must make a copy of a 10mb
file before proceeding. This can take some time and the routine that was
responsible for this had a bug in it (at least it used to). It would spin
around and around copying the file and not relenquish the thread to other
waiting threads. The result was that the fileserver "went away" until it
finished the copy, then came back. By that time the connection timed out and
the fileserver would simply terminate the RX call. This usually left a zero
length file.
Transarc fixed the problem, but I'm not sure which version they put the
change in. You should contact them to see if afs3.2b has this problem. If this
is the problem, you will notice losing contact from all of your clients to this
fileserver at that point in time.
Mark Giuffrida
Univ of Michigan
[EMAIL PROTECTED]