Dear OpenAFS 1.4.x Users,
At CERN we just suffered from a confusing problem where the fileserver process
would regularly segfault (on only one new server just put into production).
Since a gdb of the fileserver core file was showing random bit flips here and
there, we initially suspected a bad memory chip. However, the memory tested OK.
Finally we realised this was due to fssync.c in 1.4's use of select()/FD_SET
and the corrupting behaviour of those functions when using >1024 file
descriptors per process. Until quite recently this hadn't been a problem, since
RHEL kernels used ulimit -Hn 1024 by default. However, as of kernel 2.6.32-279
the limit was raised to 4096 (to purge certain distro's of dangerous
applications ;) ). This means that all 1.4.x servers running with 2.6.32-279
and later will get corrupted stacks in fssync.c and probably crash.
Note that 1.6 and beyond is safe from this RHEL kernel change since Simon
already patched fssync to use poll() 5 years ago ;)
All of the nasty details of this incident here:
https://afs.web.cern.ch/afs/reports/html/afs200SegFaults.html
We're now running with a workaround,
ulimit -Hn 1024; ulimit -Sn 1024
in our init scripts until we manage to upgrade to 1.6.
Hope this saves someone the effort of troubleshooting this again.
Cheers,
Dan van der Ster
CERN IT-DSS_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info