Hello Willy and list,
first thanks a lot for your answer!
Am 02.07.2013 00:35, schrieb Willy Tarreau:
> On Sun, Jun 30, 2013 at 07:02:04PM +0200, Tomas Pospisek wrote:
>> shortly after starting haproxy on one of our servers, haproxy goes to
>> 100% and stays there.
>>
>> We have multiple instances of haproxy running on various servers, but
>> only one instance is consistently running at 100%.
>>
>> We are running
>>
>> * 3 instances with 1.4.24-1~bpo70+1 (Debian wheezy-backports)
>> * 2 instances with 1.4.18-0ubuntu1.2 (Ubuntu precise)
>>
>> However only the one Debian instance is having the problem.
>>
>> Note that haproxy 1.4.24-1~bpo70+1 should contain the latest CPU related
>> fixes discussed here.
>>
>> When I strace that instance, it's showing lots and lots of:
>>
>> epoll_wait(0, {}, 200, 0) = 0
>
> Is it the only output for epoll_wait() or do you also see some values
> where the timeout is non-zero (the last field) ? If it's always this,
> then it probably means that a task supposed to run does not correctly
> run or does not sleep.
It's allways like this.
> Could you please try to run it with "nosepoll" in the global section ?
> It will disable the speculative epoll and use only epoll. Can you
> confirm that you don't have any health checks in your config ?
> Health-checks are a good examle of a background task. I'm trying
> to understand what could happen...
>
> Other than that, I'm not thinking about anything that could induce
> this. If you can, please try to disable the transparent proxy. We
> could imagine that it reports an error very early that is not
> properly caught and that causes the process to loop until a connect
> timeout for example.
We had another, aparently unrelated problem. Once we fixed it, haproxy
returned to normal operation mode.
The problem we had is that we have most of our virtual servers on an
internal, private network (this can be seen in the config posted in my
original mail). This was working fine before we switched to TPROXY.
Once we switched certain ports to TPROXY mode, one could no more connect
(TCP) from one internal server to a TPROXied service.
The problem was that communication from an internal server to the
service was going over the external IP and thus over haproxy.
Haproxy would then TPROXY the communication and thus preserve the
original sender IP.
Once the service got the IP packet it would reply to the *internal* IP
wich would not get routed via haproxy, but instead directly.
So there would be a socket communicating from the internal client
machine to haproxy and from haproxy to the internal server. And replies
from the server would go directly to the client. Which evidently the IP
stack is unable to grok.
Was that where the 100% CPU/epoll would come from? Haproxy tirelessly
but without success trying to establish a connection to forward the TCP
stream?
Our solution was to split the configuration into connections from the
internal private networks and into connections from the external
networks aka:
frontend smtp
bind 4.3.2.1:25,4.3.2.1:465
mode tcp
maxconn 100
acl client_from_lan src 192.168.138.0/24
use_backend smtp_b0 if client_from_lan
default_backend smtp_b1
backend smtp_b0
mode tcp
server vilan-mail 192.168.1.1
backend smtp_b1
mode tcp
source 0.0.0.0 usesrc clientip
server vilan-mail 192.168.1.1
Now "everthing's OK".
Hope this will help someone who runs into the same problem.
And maybe make haproxy more resilient in such a situation or
respectively emit some helpful diagnostic messages.
Credits: the solution to the problem connection problem was found by
Michal Fiala.
Many thanks Willy for your helpfulnes!!!
*t