Re: [PATCH] BUG/MEDIUM: mworker: clear TH_FL_STUCK in the restart_wait loop

Willy Tarreau Fri, 13 Mar 2026 22:41:18 -0700

On Thu, Mar 12, 2026 at 02:55:57PM +0100, William Lallemand wrote:
> Hi Alexander,
> 
> Sorry for the late reply, I'm trying to flush my backlog these days.
> 
> On Wed, Feb 25, 2026 at 05:47:32PM +0000, Stephan, Alexander wrote:
> > We hit a semi-reproducible crash (depending on the hardware, memory
> > allocations etc.) where the HAProxy master process is killed by its own
> > watchdog timer while inside mworker_catch_sigchld().  The crash happens when
> > many worker processes exit simultaneously. It seems to be more common on 
> > CPUs
> > with a lower clock frequence, where the loop eats up more CPU time. In a
> > worst-case scenario the CPU usage can be quite high.
> 
> How much leaving workers are we talking about, for how much total workers ?
> That seems a bit strange to us.


Agreed, TH_FL_STUCK is reset when entering the scheduler just after
processing signals. I'm having a hard time imagining that we can
spend more than one second of CPU handling signals. It seriously
sounds like a bug somewhere, but I think that clearing TH_FL_STUCK
just masks the root cause of the problem. I wouldn't be shocked by
principle by clearing the flag between signal handlers of course
since its purpose is to detect lack of progress, but I'd first want
to understand where this second of CPU is spent so that we can fix
this, because in any case it's just not acceptable to spend one
full second doing nothing but handling traffic.

Willy

Re: [PATCH] BUG/MEDIUM: mworker: clear TH_FL_STUCK in the restart_wait loop

Reply via email to