Hi ,

today we recognized a strange reproducible cache corruption bug on a linux host with a suse linux kernel and openafs 1.3.84.

if i generate a high load on a SMP client ( 30 - 40 files written simultaneous into a single volume ) after some minutes under a load (cpu) of 10 -12 on a 2 CPU (4 with Hyperthreading) system i lost contact to the afs server, with the volume i write the files into. in a tcpdump i see that the client requests (short before the lost contact) a fetch-status call several times and the server send an ABORT half a second after. now i still be able to browse on volumes that are located on a different server , but even with a fs checkvol, fs checkserver, what ever i am not able to chdir into any volume that resides on that server. even a afs-client restart doesn't fix that problem. the only solution is to stop the client; rm -Rf /afs_cache; afs-client start . i tried to reproduce the same error on a kernel 2.4 system , without any luck. even after 2 hours stress test (cpu load of 10 -20 ) i only have some short (2-5 seconds) hangs, i also see at the same time on tcpdump the fetch-status request of the client and the ABORT response from the Server, but the client recovers (in dmesg i see lost connection to server and a few second later fileserver xyz server is back up) .

the workaround for us is now to use the 2.4 kernel , but i assume, this should be fixed before 1.4 ...
if somebody is interested i can provide a compressed tcpdump (5 mb on 2.4 kernel and 9 mb on the sles9 kernel test).
we will try to deeper debug into the problem tomorrow. now i need some sleep, and any good idea, where to look at would be appreciated ... :-)

btw. client and server are connected via gigabit ethernet and the server has fibre drives. during the test the client generates a load of 10 -20 mb/sec . the server is started -p 32 -L if that makes some difference , the client is started with -stat 4000 -dcache 4000 -daemons 6 -volumes 256 -chunksize 17 -nosettime .

client has 1 GB RAM ,  server has 4 GB of ram  ..

Sven

Reply via email to