On Thu, Nov 8, 2012 at 10:41 AM, Dan Van Der Ster <[email protected]> wrote: > Dear OpenAFS 1.4.x Users, > > At CERN we just suffered from a confusing problem where the fileserver > process would regularly segfault (on only one new server just put into > production). Since a gdb of the fileserver core file was showing random bit > flips here and there, we initially suspected a bad memory chip. However, the > memory tested OK. > > Finally we realised this was due to fssync.c in 1.4's use of select()/FD_SET > and the corrupting behaviour of those functions when using >1024 file > descriptors per process. Until quite recently this hadn't been a problem, > since RHEL kernels used ulimit -Hn 1024 by default. However, as of kernel > 2.6.32-279 the limit was raised to 4096 (to purge certain distro's of > dangerous applications ;) ). This means that all 1.4.x servers running with > 2.6.32-279 and later will get corrupted stacks in fssync.c and probably crash. > > Note that 1.6 and beyond is safe from this RHEL kernel change since Simon > already patched fssync to use poll() 5 years ago ;) > > All of the nasty details of this incident here: > https://afs.web.cern.ch/afs/reports/html/afs200SegFaults.html > > We're now running with a workaround, > ulimit -Hn 1024; ulimit -Sn 1024 > in our init scripts until we manage to upgrade to 1.6. > > Hope this saves someone the effort of troubleshooting this again.
Unless you manually set HAVE_POLL, you may not have it enabled in 1.6: we didn't actually do the configure test for it. It will be fixed in 1.6.2. Incidentally, of note, currently salvsync unlike fssync doesn't ever try poll(). _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
