Hi List, Thierry,

Actually this is not limited to restarts, and also happens with 1.9dev. It now happens while haproxy was running for a while and no restart was attempted while running/debugging in my NetBeans IDE..

Root cause imo is that hlua_socket_receive_yield and hlua_socket_release both try and acquire the same lock.


For debugging purposes ive added some code in hlua_socket_receive_yield(..) before the stream_int_notify:

    struct channel *ic2 = si_ic(si);
    struct channel *oc2 = si_oc(si);
    ha_warning("hlua_socket_receive_yield calling notify peer:%9x si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state, oc2->flags, ic2->flags);
    stream_int_notify(&s->si[0]);

And:
static void hlua_socket_release(struct appctx *appctx)
{
    struct xref *peer;
    if (appctx->ctx.hlua_cosocket.xref.peer > 1)
        ha_warning("hlua_socket_release peer: %9x %9x\n", appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
    else
        ha_warning("hlua_socket_release peer: %9x 0\n", appctx->ctx.hlua_cosocket.xref);


And also added code in xref_get_peer_and_lock(..):
static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
{
    if (xref->peer == 1) {
        printf("  xref_get_peer_and_lock xref->peer == 1 \n");
    }


This produces the logging:

[WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify peer:  2355590  si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
[WARNING] 127/001127 (36579) : hlua_socket_release peer:         1 0
  xref_get_peer_and_lock xref->peer == 1

When xref_get_peer_and_lock is called with a parameter xref->peer value of 1 then it looks like it keeps swapping 1 and 1 until it is not 1, that never happens..

As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still not 100% when that flag is exactly set to get a foolproof reproduction.. but it happens on pretty much a daily basis for me in production and in test i can now usually trigger it after a few testruns with no actual traffic passing along within the first minute of running (healthchecks are performed on several backend, and a mail or 2 is send by the lua code during this startup period..).. with the full production config..

Below the stacktrace that comes with it..

xref_get_peer_and_lock (xref=0x802355590) at P:\Git\haproxy\include\common\xref.h:37
hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
si_applet_release (si=0x8023514c8) at P:\Git\haproxy\include\proto\stream_interface.h:233 stream_int_shutw_applet (si=0x8023514c8) at P:\Git\haproxy\src\stream_interface.c:1504 si_shutw (si=0x8023514c8) at P:\Git\haproxy\include\proto\stream_interface.h:320 stream_int_notify (si=0x8023514c8) at P:\Git\haproxy\src\stream_interface.c:465 hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at P:\Git\haproxy\src\hlua.c:1789
?? () at null:
?? () at null:
lua_resume () at null:
hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at P:\Git\haproxy\src\hlua.c:1022
hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053

I don't yet have a any idea about the direction of a possible fix.. :(..
Issue is that probably the hlua_socket_release should happen, but it doesnt know what socket / peer it should release at that point.. its in the local peer variable of the hlua_socket_receive_yield funtion.. Should it be 'unlocked' before calling stream_int_notify??

Does anyone dare to take a stab at a creating a patch ? If so thanks in advance ;)

Regards,
PiBa-NL (Pieter)


Op 3-5-2018 om 1:30 schreef PiBa-NL:
Hi List,

Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid> parameter) one of the processes seems to get into a 'hanging' state consuming 100% cpu..

In this configuration i'm using 'nbthread 1' not sure if this is related to the corrupted task-tree from my other lua issue.?. https://www.mail-archive.com/haproxy@formilux.org/msg29801.html .?.

Also i'm using my new smtpmailqueue and serverhealthchecker lua scripts (can be found on github.).. these probably 'contribute' to triggering the condition.

Anything i can check / provide.?

(cant really share the config itself a.t.m. as its from our production env, but it has like 15 backends with 1 server each, a little header rewriting/insertion but nothing big..)

GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/haproxy...done.
Attaching to program: /usr/local/sbin/haproxy, process 68580
Reading symbols from /lib/libcrypt.so.5...(no debugging symbols found)...done. Reading symbols from /lib/libz.so.6...(no debugging symbols found)...done. Reading symbols from /lib/libthr.so.3...(no debugging symbols found)...done. Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols found)...done. Reading symbols from /lib/libcrypto.so.8...(no debugging symbols found)...done. Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging symbols found)...done. Reading symbols from /lib/libm.so.5...(no debugging symbols found)...done. Reading symbols from /lib/libc.so.7...(no debugging symbols found)...done. Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols found)...done.
[Switching to LWP 100340 of process 68580]
0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:34
34      include/common/xref.h: No such file or directory.
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) next
45      in include/common/xref.h
(gdb) next
46      in include/common/xref.h
(gdb) next
34      in include/common/xref.h
(gdb) next
37      in include/common/xref.h
(gdb) bt full
#0  xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37
        local = 0x1
        remote = 0x7fffffffe210
#1  0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at src/hlua.c:1583
        peer = 0x0
#2  0x0000000000546519 in si_applet_release (si=0x802551dc8) at include/proto/stream_interface.h:233
        appctx = 0x80254b600
#3  0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at src/stream_interface.c:1502
        ic = 0x802551b90
        oc = 0x802551bd0
#4  0x0000000000543809 in si_shutw (si=0x802551dc8) at include/proto/stream_interface.h:320
No locals.
#5  0x0000000000543320 in stream_int_notify (si=0x802551dc8) at src/stream_interface.c:465
        conn = 0x0
        ic = 0x802551b90
        oc = 0x802551bd0
#6  0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8, status=1, ctx=0) at src/hlua.c:1770
        socket = 0x8026a0028
        wanted = -1
        hlua = 0x8024d6780
        appctx = 0x80254b600
        len = 25
        nblk = 1
        blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\" width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\" width=\"10%\"><a name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px href=\"#SMTP-Mailroute_XXXX-25_E"...
        len1 = 25
        blk2 = 0x100000005 <error: Cannot access memory at address 0x100000005>
        len2 = 3
        skip_at_end = 2
        oc = 0x802551bd0
        si = 0x802551dc8
        s = 0x802551b80
        peer = 0x80254b680
#7  0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#8  0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so
No symbol table info available.
#9  0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so
No symbol table info available.
#10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780, yield_allowed=1) at src/hlua.c:1014
        ret = 39132616
        msg = 0x802647f00 ""
#11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at src/hlua.c:5523
        hlua = 0x8024d6780
        status = HLUA_E_OK
#12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231
        t = 0x80242d460
        i = 32767
        max_processed = 199
        rq_next = 0x0
        local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x100000000, 0x0, 0x0, 0x0,
          0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758}
        local_tasks_count = 0
        final_tasks_count = 0
#13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399
        next = 588656154
        exp = 588656154
#14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at src/haproxy.c:2461
        start_lock = 0
        ptif = 0x8a0908 <per_thread_init_list>
        ptdf = 0x202
#15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at src/haproxy.c:3050
        tids = 0x802446068
        threads = 0x8024efd60
        i = 1
        err = 0
        retry = 200
        limit = {rlim_cur = 2285, rlim_max = 2285}
---Type <return> to continue, or q <return> to quit---
        errmsg = "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
        pidfd = 26
(gdb) info locals
tids = 0x802446068
threads = 0x8024efd60
i = 1
err = 0
retry = 200
limit = {rlim_cur = 2285, rlim_max = 2285}
errmsg = "\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
pidfd = 26


Reply via email to