> On 9 May 2018, at 18:30, Thierry Fournier <thierry.fourn...@arpalert.org> 
> wrote:
> 
> 
> 
>> On 8 May 2018, at 00:33, PiBa-NL <piba.nl....@gmail.com> wrote:
>> 
>> Hi List, Thierry,
>> 
>> Actually this is not limited to restarts, and also happens with 1.9dev. It 
>> now happens while haproxy was running for a while and no restart was 
>> attempted while running/debugging in my NetBeans IDE..
>> 
>> Root cause imo is that hlua_socket_receive_yield and hlua_socket_release 
>> both try and acquire the same lock.
>> 
>> 
>> For debugging purposes ive added some code in hlua_socket_receive_yield(..) 
>> before the stream_int_notify:
>> 
>>    struct channel *ic2 = si_ic(si);
>>    struct channel *oc2 = si_oc(si);
>>    ha_warning("hlua_socket_receive_yield calling notify peer:%9x  
>> si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, 
>> oc2->flags, ic2->flags);
>>    stream_int_notify(&s->si[0]);
>> 
>> And:
>> static void hlua_socket_release(struct appctx *appctx)
>> {
>>    struct xref *peer;
>>    if (appctx->ctx.hlua_cosocket.xref.peer > 1)
>>        ha_warning("hlua_socket_release peer: %9x %9x\n", 
>> appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
>>    else
>>        ha_warning("hlua_socket_release peer: %9x 0\n", 
>> appctx->ctx.hlua_cosocket.xref);
>> 
>> 
>> And also added code in xref_get_peer_and_lock(..):
>> static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
>> {
>>    if (xref->peer == 1) {
>>        printf("  xref_get_peer_and_lock xref->peer == 1 \n");
>>    }
>> 
>> 
>> This produces the logging:
>> 
>> [WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify 
>> peer:  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
>> [WARNING] 127/001127 (36579) : hlua_socket_release peer:         1 0
>>  xref_get_peer_and_lock xref->peer == 1 
>> 
>> When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 
>> then it looks like it keeps swapping 1 and 1 until it is not 1, that never 
>> happens..
>> 
>> As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 
>> 100% when that flag is exactly set to get a foolproof reproduction.. but it 
>> happens on pretty much a daily basis for me in production and in test i can 
>> now usually trigger it after a few testruns with no actual traffic passing 
>> along within the first minute of running (healthchecks are performed on 
>> several backend, and a mail or 2 is send by the lua code during this startup 
>> period..).. with the full production config..
>> 
>> Below the stacktrace that comes with it..
>> 
>> xref_get_peer_and_lock (xref=0x802355590) at 
>> P:\Git\haproxy\include\common\xref.h:37
>> hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
>> si_applet_release (si=0x8023514c8) at 
>> P:\Git\haproxy\include\proto\stream_interface.h:233
>> stream_int_shutw_applet (si=0x8023514c8) at 
>> P:\Git\haproxy\src\stream_interface.c:1504
>> si_shutw (si=0x8023514c8) at 
>> P:\Git\haproxy\include\proto\stream_interface.h:320
>> stream_int_notify (si=0x8023514c8) at 
>> P:\Git\haproxy\src\stream_interface.c:465
>> hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at 
>> P:\Git\haproxy\src\hlua.c:1789
>> ?? () at null:
>> ?? () at null:
>> lua_resume () at null:
>> hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at 
>> P:\Git\haproxy\src\hlua.c:1022
>> hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
>> process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
>> run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
>> run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
>> main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053
>> 
>> I don't yet have a any idea about the direction of a possible fix.. :(..
>> Issue is that probably the hlua_socket_release should happen, but it doesnt 
>> know what socket / peer it should release at that point.. its in the local 
>> peer variable of the hlua_socket_receive_yield funtion.. Should it be 
>> 'unlocked' before calling stream_int_notify?
> 
> 
> This part of the code is tricky.
> 
> Just a precision:
> 
> - I write the “socket” session an haproxy classic session which is created by
>   Lua code for performing network I/O.
> 
> - I write “Lua” session for a classic HAProxy session which executes “Lua”, 
> and
>   this “Lua” code communicates with the “socket” session.
> 
> This lock have two goal: maintain the link between the Lua session and the 
> “socket”
> session in a coherent state and ensure the access of the “socket” buffer from 
> the
> Lua session. When the lock is acquired, the “socket” session can’t be cleaned.
> 
> I resume:
> 
> - “Lua” session is executed
> 
> - hlua_socket_receive_yield() acquire a lock
> 
> - hlua_socket_receive_yield() call stream_int_notify()
>   NOTE: the function hlua_socket_receive_yield() can’t close the “socket” 
> session
> 
> - stream_int_notify() is executed from “Lua” session.
> 
> - stream_int_notify() try to close the “socket” session
> 
> - stream_int_notify() try to acquire the lock which is already acquired by the
>   caller function hlua_socket_receive_yield().
> 
> It seems a dead lock, but you observe a loop.
> 
> The end of the function “hlua_socket_receive_yield()” seems suspect. You’re 
> right:
> the lock can be released before the stream_int_notify(). Unfortunately, the
> stream_int_notify() is applied to the variable “s” which is a “socket” 
> session. And
> this session can be cleared between the unlock and the stream_int_notify().
> 
> A first idea is moving the stream_int_notify() from direct function to a 
> task. It is
> not a simple job, and this need more brainstorm.


PS: It is very complicated to reproduce the problem. I suppose that it appens
when the remote server close the connection just after the begin of the 
execution
of the Lua read() function. If the connection is closed before the Lua 
execution,
it is detected by HAProxy, and process is clean.

Maybe the problem can be reproduced adding a sleep() function in the c code of
the hlua_socket_receive_yield() function. Like this:

   __LJMP static int hlua_socket_receive_yield(struct lua_State *L, int status, 
lua_KContext ctx)
   {
      // [...] function variables

      fprintf(stderr, “Now, I sleep 5S, close the server connection in order to 
reproduce the bug.\n”);
      sleep(5);

      // [...] function main part
   }

I try to reproduce it later.
 
Thierry


> 
> BR,
> Thierry
> 
> 
> 
>> Does anyone dare to take a stab at a creating a patch ? If so thanks in 
>> advance ;)
>> 
>> Regards,
>> PiBa-NL (Pieter)
>> 
>> 
>> Op 3-5-2018 om 1:30 schreef PiBa-NL:
>>> Hi List, 
>>> 
>>> Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid> 
>>> parameter) one of the processes seems to get into a 'hanging' state 
>>> consuming 100% cpu.. 
>>> 
>>> In this configuration i'm using 'nbthread 1' not sure if this is related to 
>>> the corrupted task-tree from my other lua issue.?. 
>>> https://www.mail-archive.com/haproxy@formilux.org/msg29801.html .?. 
>>> 
>>> Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts 
>>> (can be found on github.).. these probably 'contribute' to triggering the 
>>> condition. 
>>> 
>>> Anything i can check / provide.? 
>>> 
>>> (cant really share the config itself a.t.m. as its from our production env, 
>>> but it has like 15 backends with 1 server each, a little header 
>>> rewriting/insertion but nothing big..) 
>>> 
>>> GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD] 
>>> Copyright (C) 2017 Free Software Foundation, Inc. 
>>> License GPLv3+: GNU GPL version 3 or later 
>>> <http://gnu.org/licenses/gpl.html> 
>>> This is free software: you are free to change and redistribute it. 
>>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying" 
>>> and "show warranty" for details. 
>>> This GDB was configured as "x86_64-portbld-freebsd11.1". 
>>> Type "show configuration" for configuration details. 
>>> For bug reporting instructions, please see: 
>>> <http://www.gnu.org/software/gdb/bugs/>. 
>>> Find the GDB manual and other documentation resources online at: 
>>> <http://www.gnu.org/software/gdb/documentation/>. 
>>> For help, type "help". 
>>> Type "apropos word" to search for commands related to "word"... 
>>> Reading symbols from /usr/local/sbin/haproxy...done. 
>>> Attaching to program: /usr/local/sbin/haproxy, process 68580 
>>> Reading symbols from /lib/libcrypt.so.5...(no debugging symbols 
>>> found)...done. 
>>> Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done. 
>>> Reading symbols from /lib/libthr.so.3...(no debugging symbols 
>>> found)...done. 
>>> Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols 
>>> found)...done. 
>>> Reading symbols from /lib/libcrypto.so.8...(no debugging symbols 
>>> found)...done. 
>>> Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging symbols 
>>> found)...done. 
>>> Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done. 
>>> Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done. 
>>> Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols 
>>> found)...done. 
>>> [Switching to LWP 100340 of process 68580] 
>>> 0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at 
>>> include/common/xref.h:34 
>>> 34      include/common/xref.h: No such file or directory. 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) next 
>>> 45      in include/common/xref.h 
>>> (gdb) next 
>>> 46      in include/common/xref.h 
>>> (gdb) next 
>>> 34      in include/common/xref.h 
>>> (gdb) next 
>>> 37      in include/common/xref.h 
>>> (gdb) bt full 
>>> #0  xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37 
>>>        local = 0x1 
>>>        remote = 0x7fffffffe210 
>>> #1  0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at 
>>> src/hlua.c:1583 
>>>        peer = 0x0 
>>> #2  0x0000000000546519 in si_applet_release (si=0x802551dc8) at 
>>> include/proto/stream_interface.h:233 
>>>        appctx = 0x80254b600 
>>> #3  0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at 
>>> src/stream_interface.c:1502 
>>>        ic = 0x802551b90 
>>>        oc = 0x802551bd0 
>>> #4  0x0000000000543809 in si_shutw (si=0x802551dc8) at 
>>> include/proto/stream_interface.h:320 
>>> No locals. 
>>> #5  0x0000000000543320 in stream_int_notify (si=0x802551dc8) at 
>>> src/stream_interface.c:465 
>>>        conn = 0x0 
>>>        ic = 0x802551b90 
>>>        oc = 0x802551bd0 
>>> #6  0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8, 
>>> status=1, ctx=0) at src/hlua.c:1770 
>>>        socket = 0x8026a0028 
>>>        wanted = -1 
>>>        hlua = 0x8024d6780 
>>>        appctx = 0x80254b600 
>>>        len = 25 
>>>        nblk = 1 
>>>        blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\" 
>>> width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\" width=\"10%\"><a 
>>> name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px 
>>> href=\"#SMTP-Mailroute_XXXX-25_E"... 
>>>        len1 = 25 
>>>        blk2 = 0x100000005 <error: Cannot access memory at address 
>>> 0x100000005> 
>>>        len2 = 3 
>>>        skip_at_end = 2 
>>>        oc = 0x802551bd0 
>>>        si = 0x802551dc8 
>>>        s = 0x802551b80 
>>>        peer = 0x80254b680 
>>> #7  0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so 
>>> ---Type <return> to continue, or q <return> to quit--- 
>>> No symbol table info available. 
>>> #8  0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so 
>>> No symbol table info available. 
>>> #9  0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so 
>>> No symbol table info available. 
>>> #10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780, 
>>> yield_allowed=1) at src/hlua.c:1014 
>>>        ret = 39132616 
>>>        msg = 0x802647f00 "" 
>>> #11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at 
>>> src/hlua.c:5523 
>>>        hlua = 0x8024d6780 
>>>        status = HLUA_E_OK 
>>> #12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231 
>>>        t = 0x80242d460 
>>>        i = 32767 
>>>        max_processed = 199 
>>>        rq_next = 0x0 
>>>        local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0, 0x80254b600, 
>>> 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x100000000, 0x0, 0x0, 
>>> 0x0, 
>>>          0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758} 
>>>        local_tasks_count = 0 
>>>        final_tasks_count = 0 
>>> #13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399 
>>>        next = 588656154 
>>>        exp = 588656154 
>>> #14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at 
>>> src/haproxy.c:2461 
>>>        start_lock = 0 
>>>        ptif = 0x8a0908 <per_thread_init_list> 
>>>        ptdf = 0x202 
>>> #15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at 
>>> src/haproxy.c:3050 
>>>        tids = 0x802446068 
>>>        threads = 0x8024efd60 
>>>        i = 1 
>>>        err = 0 
>>>        retry = 200 
>>>        limit = {rlim_cur = 2285, rlim_max = 2285} 
>>> ---Type <return> to continue, or q <return> to quit--- 
>>>        errmsg = 
>>> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>>>        pidfd = 26 
>>> (gdb) info locals 
>>> tids = 0x802446068 
>>> threads = 0x8024efd60 
>>> i = 1 
>>> err = 0 
>>> retry = 200 
>>> limit = {rlim_cur = 2285, rlim_max = 2285} 
>>> errmsg = 
>>> "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
>>> pidfd = 26 


Reply via email to