On 2017/5/29 16:04, Frederic Lecaille wrote:
> On 05/29/2017 06:12 PM, Patrick Hemmer wrote:
>>
>> On 2017/5/29 08:22, Frederic Lecaille wrote:
>>>
>>> Hi Patrick,
>>>
>>> First thank you for this nice and helpful report.
>>>
>>> Would it be possible to have an output of this command the next time
>>> you reproduce such an issue please?
>>>
>>>     echo "show sess" | socat stdio <haproxy stats socket path>
>>
>> Unfortunately this would not be possible. When the issue occurs, the
>> haproxy process has stopped accepting connections on all sockets. If I
>> were to run this command, it would be sent to the new process, not the
>> one that won't shut down.
>
>
> If you send a SIGHUP to haproxy-systemd-wrapper it asks the old
> process to graceful stop.
Yes, that is what my issue report is about. When sent a SIGHUP, the new
process comes up, but the old process won't shut down.

>
> Please have a look to this documentation:
>
> https://cbonte.github.io/haproxy-dconv/1.7/management.html#4
>
> So you are true, if everything goes well no more connection are
> accept()'ed by the old process (the sockets have been unbound). But in
> your reported case the peers sockets are not closed because still in
> CLOSE_WAIT state, so are still being processed, so stats information
> are still available from the socket stats.
The information might still be tracked within the process, but there is
no way to query the information because the process is no longer
accepting new connections. The new process has taken over control of the
admin socket.

>
> If I have missed something please does not hesitate to yell at me ;) .
>
> I have been told that "show sess *all*" give more information.
>
>>>
>>> I have only one question (see below).
>>>
>>> On 05/24/2017 10:40 AM, Willy Tarreau wrote:
>>>> Hi Patrick,
>>>>
>>>> On Tue, May 23, 2017 at 01:49:42PM -0400, Patrick Hemmer wrote:
>>>> (...)
>>>>> haproxy 28856 root    1u     IPv4          420797940      0t0
>>>>> TCP 10.0.33.145:35754->10.0.33.147:1029 (CLOSE_WAIT)
>>>>> haproxy 28856 root    2u     IPv4          420266351      0t0
>>>>> TCP 10.0.33.145:52898->10.0.33.147:1029 (CLOSE_WAIT)
>>>>> haproxy 28856 root    3r      REG                0,3        0
>>>>> 4026531956 net
>>>>> haproxy 28856 root    4u     IPv4          422150834      0t0
>>>>> TCP 10.0.33.145:38874->10.0.33.147:1029 (CLOSE_WAIT)
>>>>
>>>> These ones are very interesting.
>>>
>>> These traces also seem interesting to me.
>>>
>>> # strace -p 28856
>>> Process 28856 attached
>>> epoll_wait(0, {}, 200, 319)             = 0
>>> epoll_wait(0, {}, 200, 0)               = 0
>>> epoll_wait(0, {}, 200, 362)             = 0
>>> epoll_wait(0, {}, 200, 0)               = 0
>>> epoll_wait(0, {}, 200, 114)             = 0
>>> epoll_wait(0, {}, 200, 0)               = 0
>>> epoll_wait(0, {}, 200, 203)             = 0
>>> epoll_wait(0, {}, 200, 0)               = 0
>>> epoll_wait(0, {}, 200, 331)             = 0
>>> epoll_wait(0, {}, 200, 0)
>>>
>>>
>>> Were such "epoll_wait(0, 0, 200, 0)" calls infinitively displayed?
>> Yes
>>
>>>
>>>
>>> In fact I am wondering if it is normal to have so much epoll_wait(0,
>>> {}, 200, 0) calls for a haproxy process which has shut down.
>>>
>>> I suspect they are in relation with peer tasks (obviously which has
>>> expired).
>>>
>>> If this is the case, and with configurations with only peer tasks,
>>> haproxy would definitively hang consuming a lot of CPU resources.
>> HAProxy was not consuming high CPU. Note that in every other call to
>> `epoll_wait`, the 4th value was >0. If every single timeout value were
>> 0, then yes, it would spin consuming CPU.
>>
>
> agreed... but perhaps your configuration does not use only peer tasks,
> contrary to my configuration... this is your traces which lead me to
> check how the peer task expiration is handled with configurations with
> only peers as backends.
>
> In my case with only two peers I see such following traces, after a
> peer has sent a synchronization request:
>
>     epoll_wait(0, {}, 200, 1000}
>     epoll_wait(0, {}, 200, 1000}
>     epoll_wait(0, {}, 200, 1000}
>     epoll_wait(0, {}, 200, 1000}
>     epoll_wait(0, {}, 200, X}    # with X < 1000
>
> followed by a big loop of
>
>     epoll_wait(0, {}, 200, 0}    # so consuming high CPU resources
> during a fraction of second
>
> then:
>
>     shutdown(SHUT_WR)            # FIN TCP segment at about 5s after
> the first epoll_wait(0, 0, 200, 1000} above.
>
> then again:
>
>     epoll_wait(0, {}, 200, 0}
>
> until the remote peer, which is in CLOSE_WAIT state shuts down its
> socket.
This doesn't make sense. CLOSE_WAIT means that the remote side has
already closed the socket, and the application on the local side needs
to issue a close() on it.

>
> I have not told you that a synchronization request is the first thing
> a peer launches: the peers of the new process try to synchronize with
> old process peers.
>
> With the fix I provided the process epoll_wait(0, {}, 200, *1000*} after
> having shutdown(SHUT_WR} its socket... but it concerns only peers
> which are remote from each others.
>
>
>>>
>>> So, I had a look at the peer struct task 'expire' member handling
>>> code, and I have just found a situation where pollers in relation with
>>> peer tasks are often called with an expired timeout leading haproxy to
>>> consume a lot of CPU resources. In fact this happens each time the
>>> peer task has expired during a fraction of second.
>>>
>>> It is easy to reproduce this issue with a sort of peer simulator ;):
>>>
>>> strace -ttf socat TCP4-LISTEN:<peer port>,reuseaddr,fork SYSTEM:"echo
>>> 200;sleep 10"
>>>
>>> This peer must be started *before* the other remote haproxy process
>>> with only peers as backends.
>>>
>>> strace is here only to have an idea of the moment where the remote
>>> haproxy peer has just connected.
>>>
>>> The sleep command is here to have enough time to block (ctrl + s) our
>>> peer simulator process after the haproxy peer has just connected.
>>>
>>> So this peer accepts any remote peer sessions sending "200" status
>>> messages (and that's all).
>>>
>>> A haproxy peer which connects to such a peer which does not reply to a
>>> synchronization request would endlessly consume high CPU ressources
>>> until you unblock (ctrl + q) the peer simulator process.
>>>
>>> *Unhappily, I do not see any relation between this bug and the
>>> "CLOSE_WAIT peer state issue" which prevents haproxy from correctly
>>> shutting down.*
>>>
>>> I have attached a patch to this mail which fixes this issue.
>> Again, we're not seeing high CPU usage in this specific case. We have
>> reported a completely different scenario where haproxy starts consuming
>> CPU doing `epoll_wait(x,x,x,0)`, but this is not that. Every time this
>> shutdown issue occurs, the process is not consuming CPU.
>> However it is possible the 2 issues might have the same root cause. I
>> will try out the patch and see what happens.
>
> Agreed again. This is why I unhapilly mentionned that this bug has
> nothing to see with your reported case. The root cause of your not
> shutting down haproxy process is its peer sockets in CLOSE_WAIT
> terminal state.
>
> Still investigating ;)
>
> Regards.
>

Reply via email to