On Tue, May 14, 2024 at 10:58 PM Mauro Tridici <mauro.trid...@cmcc.it> wrote:

> I will try to solve this issue by myself, but, if you have any interesting 
> idea, please, share it with me :)

It is great that you can reproduce the issue reliably - it gives hope
that we can find the problem.

I still think something is off on your production machine. So if I
were you, I would work towards being able to reproduce the issue on
another machine - preferably a VM.

Maybe install a fresh VM with the same OS. Take a snapshot (called A).
Then copy all files from production to the VM (most importantly /bin
/lib /usr /etc). If you can then reproduce the error take another
snapshot (called B). Then copy files from A to B. Can you make the
error disappear? Can you make the error appear if you copy files from
B to A?

Is there some sort of monitoring system on production that is not on
your VM? Maybe such a system would find it weird to kill off a lot of
processes in one go.

Can you trigger the error by:

  seq 10000 | parallel -j 0 sleep  &
  sleep 1
  killall -9 sleep

Happy bug hunting

/Ole

Reply via email to