Alexander Patrakov wrote:
[...]
What would be a good solution (as in: something that does not convert
crashes into deadlocks) here? I understand that, after memory
corruption, we are already in the UB territory, but is there anything
better possible than what is implemented?

I would suggest a monitor daemon that runs GDB to get the backtrace. The simplest way to do this would require Ceph to have its own supervisor (not unique; PostgreSQL has long had a "postmaster" process that manages the worker "postgres" backend processes) and provide each daemon with a pipe back to the supervisor; the fatal error handler need only write(2) to the pipe from a static string and/or fixed buffer (to report a signal number) and then enter an infinite loop; the supervisor then kills the crashed process, possibly after attaching GDB and collecting a backtrace.

Alternately, simply run the Ceph daemons with `ulimit -c` nonzero and collect the core files. The core files can be analyzed using GDB after the fact. No dedicated supervisor needed here, only kernel facilities.

The central problem here, as I understand it, is trying to do too much in a process that has gone into undefined behavior. Attaching GDB or dumping a core file both sidestep that problem.


-- Jacob

Reply via email to