Hi List, Thierry,
Actually this is not limited to restarts, and also happens with 1.9dev.
It now happens while haproxy was running for a while and no restart was
attempted while running/debugging in my NetBeans IDE..
Root cause imo is that hlua_socket_receive_yield and hlua_socket_release
both try and acquire the same lock.
For debugging purposes ive added some code in
hlua_socket_receive_yield(..) before the stream_int_notify:
struct channel *ic2 = si_ic(si);
struct channel *oc2 = si_oc(si);
ha_warning("hlua_socket_receive_yield calling notify peer:%9x
si[0].state:%d oc2.flag:%09x ic2.flag:%09x\n", peer, s->si[0].state,
oc2->flags, ic2->flags);
stream_int_notify(&s->si[0]);
And:
static void hlua_socket_release(struct appctx *appctx)
{
struct xref *peer;
if (appctx->ctx.hlua_cosocket.xref.peer > 1)
ha_warning("hlua_socket_release peer: %9x %9x\n",
appctx->ctx.hlua_cosocket.xref, appctx->ctx.hlua_cosocket.xref.peer->peer);
else
ha_warning("hlua_socket_release peer: %9x 0\n",
appctx->ctx.hlua_cosocket.xref);
And also added code in xref_get_peer_and_lock(..):
static inline struct xref *xref_get_peer_and_lock(struct xref *xref)
{
if (xref->peer == 1) {
printf(" xref_get_peer_and_lock xref->peer == 1 \n");
}
This produces the logging:
[WARNING] 127/001127 (36579) : hlua_socket_receive_yield calling notify
peer: 2355590 si[0].state:7 oc2.flag:0c000c220 ic2.flag:00084a024
[WARNING] 127/001127 (36579) : hlua_socket_release peer: 1 0
xref_get_peer_and_lock xref->peer == 1
When xref_get_peer_and_lock is called with a parameter xref->peer value
of 1 then it looks like it keeps swapping 1 and 1 until it is not 1,
that never happens..
As for the oc2.flags it contains the CF_SHUTW_NOW.. of which im still
not 100% when that flag is exactly set to get a foolproof reproduction..
but it happens on pretty much a daily basis for me in production and in
test i can now usually trigger it after a few testruns with no actual
traffic passing along within the first minute of running (healthchecks
are performed on several backend, and a mail or 2 is send by the lua
code during this startup period..).. with the full production config..
Below the stacktrace that comes with it..
xref_get_peer_and_lock (xref=0x802355590) at
P:\Git\haproxy\include\common\xref.h:37
hlua_socket_release (appctx=0x802355500) at P:\Git\haproxy\src\hlua.c:1595
si_applet_release (si=0x8023514c8) at
P:\Git\haproxy\include\proto\stream_interface.h:233
stream_int_shutw_applet (si=0x8023514c8) at
P:\Git\haproxy\src\stream_interface.c:1504
si_shutw (si=0x8023514c8) at
P:\Git\haproxy\include\proto\stream_interface.h:320
stream_int_notify (si=0x8023514c8) at
P:\Git\haproxy\src\stream_interface.c:465
hlua_socket_receive_yield (L=0x80223b388, status=1, ctx=0) at
P:\Git\haproxy\src\hlua.c:1789
?? () at null:
?? () at null:
lua_resume () at null:
hlua_ctx_resume (lua=0x8022cb800, yield_allowed=1) at
P:\Git\haproxy\src\hlua.c:1022
hlua_process_task (task=0x80222a500) at P:\Git\haproxy\src\hlua.c:5556
process_runnable_tasks () at P:\Git\haproxy\src\task.c:232
run_poll_loop () at P:\Git\haproxy\src\haproxy.c:2401
run_thread_poll_loop (data=0x802242080) at P:\Git\haproxy\src\haproxy.c:2463
main (argc=4, argv=0x7fffffffea80) at P:\Git\haproxy\src\haproxy.c:3053
I don't yet have a any idea about the direction of a possible fix.. :(..
Issue is that probably the hlua_socket_release should happen, but it
doesnt know what socket / peer it should release at that point.. its in
the local peer variable of the hlua_socket_receive_yield funtion..
Should it be 'unlocked' before calling stream_int_notify??
Does anyone dare to take a stab at a creating a patch ? If so thanks in
advance ;)
Regards,
PiBa-NL (Pieter)
Op 3-5-2018 om 1:30 schreef PiBa-NL:
Hi List,
Sometimes after a few 'restarts' of haproxy 1.8.8 (using -sf <pid>
parameter) one of the processes seems to get into a 'hanging' state
consuming 100% cpu..
In this configuration i'm using 'nbthread 1' not sure if this is
related to the corrupted task-tree from my other lua issue.?.
https://www.mail-archive.com/haproxy@formilux.org/msg29801.html .?.
Also i'm using my new smtpmailqueue and serverhealthchecker lua
scripts (can be found on github.).. these probably 'contribute' to
triggering the condition.
Anything i can check / provide.?
(cant really share the config itself a.t.m. as its from our production
env, but it has like 15 backends with 1 server each, a little header
rewriting/insertion but nothing big..)
GNU gdb (GDB) 8.0.1 [GDB v8.0.1 for FreeBSD]
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd11.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/sbin/haproxy...done.
Attaching to program: /usr/local/sbin/haproxy, process 68580
Reading symbols from /lib/libcrypt.so.5...(no debugging symbols
found)...done.
Reading symbols from /lib/libz.so.6...(no debugging symbols
found)...done.
Reading symbols from /lib/libthr.so.3...(no debugging symbols
found)...done.
Reading symbols from /usr/lib/libssl.so.8...(no debugging symbols
found)...done.
Reading symbols from /lib/libcrypto.so.8...(no debugging symbols
found)...done.
Reading symbols from /usr/local/lib/liblua-5.3.so...(no debugging
symbols found)...done.
Reading symbols from /lib/libm.so.5...(no debugging symbols
found)...done.
Reading symbols from /lib/libc.so.7...(no debugging symbols
found)...done.
Reading symbols from /libexec/ld-elf.so.1...(no debugging symbols
found)...done.
[Switching to LWP 100340 of process 68580]
0x000000000044890b in xref_get_peer_and_lock (xref=0x80254b680) at
include/common/xref.h:34
34 include/common/xref.h: No such file or directory.
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) next
45 in include/common/xref.h
(gdb) next
46 in include/common/xref.h
(gdb) next
34 in include/common/xref.h
(gdb) next
37 in include/common/xref.h
(gdb) bt full
#0 xref_get_peer_and_lock (xref=0x80254b680) at include/common/xref.h:37
local = 0x1
remote = 0x7fffffffe210
#1 0x000000000044833d in hlua_socket_release (appctx=0x80254b600) at
src/hlua.c:1583
peer = 0x0
#2 0x0000000000546519 in si_applet_release (si=0x802551dc8) at
include/proto/stream_interface.h:233
appctx = 0x80254b600
#3 0x0000000000541a8d in stream_int_shutw_applet (si=0x802551dc8) at
src/stream_interface.c:1502
ic = 0x802551b90
oc = 0x802551bd0
#4 0x0000000000543809 in si_shutw (si=0x802551dc8) at
include/proto/stream_interface.h:320
No locals.
#5 0x0000000000543320 in stream_int_notify (si=0x802551dc8) at
src/stream_interface.c:465
conn = 0x0
ic = 0x802551b90
oc = 0x802551bd0
#6 0x000000000044dc81 in hlua_socket_receive_yield (L=0x80243f2a8,
status=1, ctx=0) at src/hlua.c:1770
socket = 0x8026a0028
wanted = -1
hlua = 0x8024d6780
appctx = 0x80254b600
len = 25
nblk = 1
blk1 = 0x8027da4d4 "501 5.1.3 Invalid address\r\nclass=\"tbl\"
width=\"100%\">\n<tr class=\"titre\"><th class=\"pxname\"
width=\"10%\"><a
name=\"SMTP-Mailroute_XXXX-25_EX01-125_tcp_ipv4\"></a><a class=px
href=\"#SMTP-Mailroute_XXXX-25_E"...
len1 = 25
blk2 = 0x100000005 <error: Cannot access memory at address
0x100000005>
len2 = 3
skip_at_end = 2
oc = 0x802551bd0
si = 0x802551dc8
s = 0x802551b80
peer = 0x80254b680
#7 0x0000000801880417 in ?? () from /usr/local/lib/liblua-5.3.so
---Type <return> to continue, or q <return> to quit---
No symbol table info available.
#8 0x000000080187f1a6 in ?? () from /usr/local/lib/liblua-5.3.so
No symbol table info available.
#9 0x0000000801880194 in lua_resume () from /usr/local/lib/liblua-5.3.so
No symbol table info available.
#10 0x00000000004370a9 in hlua_ctx_resume (lua=0x8024d6780,
yield_allowed=1) at src/hlua.c:1014
ret = 39132616
msg = 0x802647f00 ""
#11 0x00000000004406c7 in hlua_process_task (task=0x80242d460) at
src/hlua.c:5523
hlua = 0x8024d6780
status = HLUA_E_OK
#12 0x0000000000591177 in process_runnable_tasks () at src/task.c:231
t = 0x80242d460
i = 32767
max_processed = 199
rq_next = 0x0
local_tasks = {0x0, 0x8add10 <applet_active_lock>, 0x0,
0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600, 0x80254b600,
0x100000000, 0x0, 0x0, 0x0,
0x8add10 <applet_active_lock>, 0x0, 0x0, 0x7fffffffe758}
local_tasks_count = 0
final_tasks_count = 0
#13 0x0000000000510a5b in run_poll_loop () at src/haproxy.c:2399
next = 588656154
exp = 588656154
#14 0x000000000050e370 in run_thread_poll_loop (data=0x802446068) at
src/haproxy.c:2461
start_lock = 0
ptif = 0x8a0908 <per_thread_init_list>
ptdf = 0x202
#15 0x000000000050aef2 in main (argc=10, argv=0x7fffffffec58) at
src/haproxy.c:3050
tids = 0x802446068
threads = 0x8024efd60
i = 1
err = 0
retry = 200
limit = {rlim_cur = 2285, rlim_max = 2285}
---Type <return> to continue, or q <return> to quit---
errmsg =
"\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
pidfd = 26
(gdb) info locals
tids = 0x802446068
threads = 0x8024efd60
i = 1
err = 0
retry = 200
limit = {rlim_cur = 2285, rlim_max = 2285}
errmsg =
"\000\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000R\325c\321\223Z\332\017\240u\212\000\000\000\000\000P\354\377\377\377\177\000\000\260\354\377\377\377\177\000\000X\354\377\377\377\177\000\000\n\000\000\000\000\000\000\000\360\353\377\377\377\177\000\000r3\340\001\b\000\000\000\001\000\000"
pidfd = 26