Dear OpenAFS 1.4.x Users,

At CERN we just suffered from a confusing problem where the fileserver process 
would regularly segfault (on only one new server just put into production). 
Since a gdb of the fileserver core file was showing random bit flips here and 
there, we initially suspected a bad memory chip. However, the memory tested OK.

Finally we realised this was due to fssync.c in 1.4's use of select()/FD_SET 
and the corrupting behaviour of those functions when using >1024 file 
descriptors per process. Until quite recently this hadn't been a problem, since 
RHEL kernels used ulimit -Hn 1024 by default. However, as of kernel 2.6.32-279 
the limit was raised to 4096 (to purge certain distro's of dangerous 
applications ;) ). This means that all 1.4.x servers running with 2.6.32-279 
and later will get corrupted stacks in fssync.c and probably crash.

Note that 1.6 and beyond is safe from this RHEL kernel change since Simon 
already patched fssync to use poll() 5 years ago ;) 

All of the nasty details of this incident here:
    https://afs.web.cern.ch/afs/reports/html/afs200SegFaults.html

We're now running with a workaround,
  ulimit -Hn 1024; ulimit -Sn 1024
in our init scripts until we manage to upgrade to 1.6.

Hope this saves someone the effort of troubleshooting this again.

Cheers, 
Dan van der Ster
CERN IT-DSS_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to