We see exactly the same error as reported by Kirby Bakken in November 2006
 (client looses contact with the fileserver under a heavy load). On fast
multicore
 opterons we can reproduce it in 100% of cases. The error occures under a
heavy load
 with an application that calculates the checksums of 3000+ files. Some
details:

 - It happens with OAFS 1.4.1, 1.4.2 and 1.4.4
 - It happens on different Red Hat kernels: 2.6.9-42.0.10.ELsmp,
2.6.9-55.ELsmp
 - It is reproducible on several identical machines
 - It happens with the most abundant afsd parameters, with cache on disk or
ramdisk
 - It does not happen with small memcaches (65536,131072)
 - It reappears with the memcache of 256MB

 - On the fileserver side with verbose debug in FileLog everything is clean
 - On the client side, we have captured a pair of fstrace outputs, they may
be
   seen under http://afs.caspur.it/rtb (these are large files of 100+MB, in
a zipped
   form; Rainer Toebbicke had pointed us at the point where the error
occured, he
   was going to comment on it in a separate mail).

 If somebody from the group wishes to debug it, we could provide him/her
with the
 access to one of the machines in question, show how the error may be
reproduced
 and give any needed support during the debug operations.

 Andrei.

On 11/14/06, Kirby Bakken <[EMAIL PROTECTED]> wrote:


More information....  Here's the 'format' of the write error messages:

afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all 
multi-homed ip addresses down for the server)
afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all 
multi-homed ip addresses down for the server)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: file server 9.41.253.103 in cell austin.ibm.com is back up
(multi-homed address; other same-host interfaces may still be down)
afs: file server 9.41.253.103 in cell austin.ibm.com is back up
(multi-homed address; other same-host interfaces may still be down)
.......

Reply via email to