We see exactly the same error as reported by Kirby Bakken in November 2006 (client looses contact with the fileserver under a heavy load). On fast multicore opterons we can reproduce it in 100% of cases. The error occures under a heavy load with an application that calculates the checksums of 3000+ files. Some details:
- It happens with OAFS 1.4.1, 1.4.2 and 1.4.4 - It happens on different Red Hat kernels: 2.6.9-42.0.10.ELsmp, 2.6.9-55.ELsmp - It is reproducible on several identical machines - It happens with the most abundant afsd parameters, with cache on disk or ramdisk - It does not happen with small memcaches (65536,131072) - It reappears with the memcache of 256MB - On the fileserver side with verbose debug in FileLog everything is clean - On the client side, we have captured a pair of fstrace outputs, they may be seen under http://afs.caspur.it/rtb (these are large files of 100+MB, in a zipped form; Rainer Toebbicke had pointed us at the point where the error occured, he was going to comment on it in a separate mail). If somebody from the group wishes to debug it, we could provide him/her with the access to one of the machines in question, show how the error may be reproduced and give any needed support during the debug operations. Andrei. On 11/14/06, Kirby Bakken <[EMAIL PROTECTED]> wrote:
More information.... Here's the 'format' of the write error messages: afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all multi-homed ip addresses down for the server) afs: Lost contact with file server 9.41.253.103 in cell austin.ibm.com(all multi-homed ip addresses down for the server) afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: file server 9.41.253.103 in cell austin.ibm.com is back up (multi-homed address; other same-host interfaces may still be down) afs: file server 9.41.253.103 in cell austin.ibm.com is back up (multi-homed address; other same-host interfaces may still be down) .......
