Hi Willy,

On Fri, 29 Mar 2024 07:17:56 +0100
Willy Tarreau <w...@1wt.eu> wrote:

> > These "connection refused" is from our watchdog; but the effects are as
> > perceptible from the outside.  When our watchdog hits this situation,
> > it will forcefully restart HAProxy (we have 2 instances) because there
> > will be a considerable service degradation.  If you remember, there's
> > https://github.com/haproxy/haproxy/issues/1895 and we talked briefly
> > about this in person, at HAProxyConf.  
> 
> Thanks for the context. I didn't remember about the issue. I remembered
> we discussed for a while but didn't remember about the issue in question
> obviously, given the number of issues I'm dealing with :-/
> 
> In the issue above I'm seeing an element from Felipe saying that telnet
> to port 80 can take between 3 seconds to accept. That really makes me
> think about either the SYN queue being full, causing drops and retransmits,
> or a lack of socket memory to accept packets. That one could possibly be
> caused by tcp_mem not being large enough due to some transfers with high
> latency fast clients taking a lot of RAM, but it should not affect the
> local UNIX socket. Also, killing the process means killing all the
> associated connections and will definitely result in freeing a huge
> amount of network buffers, so it could fuel that direction. If you have
> two instances, did you notice if the two start to behave badly at the
> same time ? If that's the case, it would definitely indicate a possible
> resource-based cause like socket memory etc.

Of our 2 HAProxy instances, it is usually one (mostly the frontend one)
that exhibits this behavior.  And as it is imperative that the
corrective action be as swift as possible, all instances are terminated
(which can include older instances, from graceful reloads), and new
instances are started.  Very harsh, but at >50 Gbps, each full second
of downtime adds up considerably to network pressure.

So for context, our least capable machine has 256 GB RAM.  We have not
seen any spikes over the metrics we monitor, and this issue tends to
happen at a very stable steady-state, albeit a loaded one.  While it is
currently outside of our range for detailed data, we didn't notice
anything unusual, especially regarding memory usage, on these traps we
reported.

But of course, there could be a metric that we're not yet aware that
correlates.  Anyone from the dustier, darkest corners that you know of?
:-)


> 
> > But this is incredibly elusive to reproduce; it comes and goes.  It
> > might happen every few minutes, or not happen at all for months.  Not
> > tied to a specific setup: different versions, kernels, machines.  In
> > fact, we do not have better ways to detect the situation, at least not
> > as fast, reactive, and resilient.  
> 
> It might be useful to take periodic snapshots of /proc/slabinfo and
> see if something jumps during such incidents (grep for TCP, net, skbuff
> there). I guess you have not noticed any "out of socket memory" nor such
> indications in your kernel logs, of course :-/

We have no indications of memory pressure related to network.  At the
peak, we usually see something like 15~22% overall active memory (it
fails me, but it might take >70% of active memory for these machines to
actually degrade, maybe more).  As for TCP stuff, around 16~30k active
sockets, plus some 50~100k in timewait, and still not creating any
problems.


> 
> Another one that could make sense to monitor is "PoolFailed" in
> "show info". It should always remain zero.

We collect this (all available actually); I don't remember this one
ever measuring more than zero.  But we'll keep an eye on it.

In time, could this be somewhat unrelated to HAProxy?  I.e., maybe
kernel?

Cheers,

-- 
Ricardo Nabinger Sanchez             https://www.taghos.com.br/

Reply via email to