Hi Willy, On Fri, 29 Mar 2024 07:17:56 +0100 Willy Tarreau <w...@1wt.eu> wrote:
> > These "connection refused" is from our watchdog; but the effects are as > > perceptible from the outside. When our watchdog hits this situation, > > it will forcefully restart HAProxy (we have 2 instances) because there > > will be a considerable service degradation. If you remember, there's > > https://github.com/haproxy/haproxy/issues/1895 and we talked briefly > > about this in person, at HAProxyConf. > > Thanks for the context. I didn't remember about the issue. I remembered > we discussed for a while but didn't remember about the issue in question > obviously, given the number of issues I'm dealing with :-/ > > In the issue above I'm seeing an element from Felipe saying that telnet > to port 80 can take between 3 seconds to accept. That really makes me > think about either the SYN queue being full, causing drops and retransmits, > or a lack of socket memory to accept packets. That one could possibly be > caused by tcp_mem not being large enough due to some transfers with high > latency fast clients taking a lot of RAM, but it should not affect the > local UNIX socket. Also, killing the process means killing all the > associated connections and will definitely result in freeing a huge > amount of network buffers, so it could fuel that direction. If you have > two instances, did you notice if the two start to behave badly at the > same time ? If that's the case, it would definitely indicate a possible > resource-based cause like socket memory etc. Of our 2 HAProxy instances, it is usually one (mostly the frontend one) that exhibits this behavior. And as it is imperative that the corrective action be as swift as possible, all instances are terminated (which can include older instances, from graceful reloads), and new instances are started. Very harsh, but at >50 Gbps, each full second of downtime adds up considerably to network pressure. So for context, our least capable machine has 256 GB RAM. We have not seen any spikes over the metrics we monitor, and this issue tends to happen at a very stable steady-state, albeit a loaded one. While it is currently outside of our range for detailed data, we didn't notice anything unusual, especially regarding memory usage, on these traps we reported. But of course, there could be a metric that we're not yet aware that correlates. Anyone from the dustier, darkest corners that you know of? :-) > > > But this is incredibly elusive to reproduce; it comes and goes. It > > might happen every few minutes, or not happen at all for months. Not > > tied to a specific setup: different versions, kernels, machines. In > > fact, we do not have better ways to detect the situation, at least not > > as fast, reactive, and resilient. > > It might be useful to take periodic snapshots of /proc/slabinfo and > see if something jumps during such incidents (grep for TCP, net, skbuff > there). I guess you have not noticed any "out of socket memory" nor such > indications in your kernel logs, of course :-/ We have no indications of memory pressure related to network. At the peak, we usually see something like 15~22% overall active memory (it fails me, but it might take >70% of active memory for these machines to actually degrade, maybe more). As for TCP stuff, around 16~30k active sockets, plus some 50~100k in timewait, and still not creating any problems. > > Another one that could make sense to monitor is "PoolFailed" in > "show info". It should always remain zero. We collect this (all available actually); I don't remember this one ever measuring more than zero. But we'll keep an eye on it. In time, could this be somewhat unrelated to HAProxy? I.e., maybe kernel? Cheers, -- Ricardo Nabinger Sanchez https://www.taghos.com.br/