Alexander Patrakov wrote:
[...] What would be a good solution (as in: something that does not convert crashes into deadlocks) here? I understand that, after memory corruption, we are already in the UB territory, but is there anything better possible than what is implemented?
I would suggest a monitor daemon that runs GDB to get the backtrace. The simplest way to do this would require Ceph to have its own supervisor (not unique; PostgreSQL has long had a "postmaster" process that manages the worker "postgres" backend processes) and provide each daemon with a pipe back to the supervisor; the fatal error handler need only write(2) to the pipe from a static string and/or fixed buffer (to report a signal number) and then enter an infinite loop; the supervisor then kills the crashed process, possibly after attaching GDB and collecting a backtrace.
Alternately, simply run the Ceph daemons with `ulimit -c` nonzero and collect the core files. The core files can be analyzed using GDB after the fact. No dedicated supervisor needed here, only kernel facilities.
The central problem here, as I understand it, is trying to do too much in a process that has gone into undefined behavior. Attaching GDB or dumping a core file both sidestep that problem.
-- Jacob