On 08/28/2018 11:19 AM, Frederic Lecaille wrote:
On 08/27/2018 10:46 PM, PiBa-NL wrote:
Hi Frederic, Oliver,
Hi Pieter,
Thanks for your investigations :).
I've made a little reg-test (files attached). Its probably not
'correct' to commit as-is, but should be enough to get a
reproduction.. I hope..
changing it to nbthread 1 makes it work every time..(that i tried)
Your script is correct. Thank you a lot for this Pieter.
The test actually seems to show a variety of issues.
## Every once in a while it takes like 7 seconds to run a test..
During which cpu usage is high..
Sounds like the first issue you reported. You can use -t varnistest
option to set a large timeout so that you might have enough time to kill
varnistest (Ctrl+C) to prevent it to kill haproxy. Then you can attach
gdb to the haproxy process.
---- c0 7.6 HTTP rx timeout (fd:5 7500 ms)
## But most of the time, it just doesn't finish with a correct result
(ive seen haproxy do core dumps also while testing..). There is of
course the option that i did something wrong in the lua as well...
Does the test itself work for you guys? (with nbthread 1)
I have not managed to make this script fails with "nbthread 1".
I have also seen coredumps with "nbthread 3" even with only one HTTP
request from c0 client:
client c0 -connect ${h1_fe1_sock} {
txreq -url "/"
rxresp
expect resp.status == 200
}
If you run varnishtest with -l option, it leaves the temporary vtc.*
directory if the test failed.
If you set your environment to produce coredumps (ulimit -c unlimited)
you should find coredump files in /tmp/vtc.*/<haproxy instance name>/
directory (/tmp/vtc.*/h1/ in our case).
According to gdb we have an issue in src/ssl_sock.c. So I have CC this
mail to Emeric:
Reading symbols from haproxy...done.
[New LWP 32432]
[New LWP 32431]
[New LWP 32428]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/flecaille/src/haproxy/haproxy -d -f
/tmp/vtc.32410.6f80f987/h1/cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f78f98bba56 in ASN1_get_object ()
from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
[Current thread is 1 (Thread 0x7f78f8522700 (LWP 32432))]
(gdb) bt full
#0 0x00007f78f98bba56 in ASN1_get_object ()
from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#1 0x00007f78f98c2ff8 in ?? () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#2 0x00007f78f98c41b5 in ?? () from
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#3 0x00007f78f98c4ead in ASN1_item_ex_d2i ()
from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#4 0x00007f78f98c4f2b in ASN1_item_d2i ()
from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#5 0x00007f78f9cdac98 in d2i_SSL_SESSION ()
from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
No symbol table info available.
#6 0x000055e2078be006 in ssl_sock_init (conn=0x7f78e8012220) at
src/ssl_sock.c:4985
ptr = 0xf800000000000000 <error: Cannot access memory at
address 0xf800000000000000>
sess = <optimized out>
may_retry = <optimized out>
conn = 0x7f78e8012220
#7 0x000055e20797dfc1 in conn_xprt_init (conn=0x7f78e8012220)
at include/proto/connection.h:84
ret = 0
#8 tcp_connect_server (conn=0x7f78e8012220, data=0, delack=<optimized
out>)
at src/proto_tcp.c:545
fd = 18
srv = <optimized out>
be = 0x55e207c567e0 <socket_proxy>
src = <optimized out>
#9 0x000055e207981aba in si_connect (si=0x7f78e8017680)
at include/proto/stream_interface.h:366
ret = 0
#10 connect_server (s=s@entry=0x7f78e80173b0) at src/backend.c:1223
cli_conn = 0x0
srv_conn = 0x7f78e8012220
srv_cs = <optimized out>
old_cs = <optimized out>
reuse = <optimized out>
err = <optimized out>
#11 0x000055e207924295 in sess_update_stream_int (s=0x7f78e80173b0) at
src/stream.c:885
conn_err = <optimized out>
si = 0x7f78e8017680
req = 0x7f78e80173c0
#12 process_stream (t=<optimized out>, context=0x7f78e80173b0,
state=<optimized out>)
at src/stream.c:2240
s = 0x7f78e80173b0
sess = <optimized out>
rqf_last = <optimized out>
rpf_last = 2147483648
rq_prod_last = <optimized out>
rq_cons_last = <optimized out>
rp_cons_last = <optimized out>
rp_prod_last = <optimized out>
req_ana_back = <optimized out>
req = 0x7f78e80173c0
res = 0x7f78e8017420
si_f = 0x7f78e8017638
si_b = 0x7f78e8017680
#13 0x000055e2079ab1f8 in process_runnable_tasks () at src/task.c:381
t = <optimized out>
state = <optimized out>
ctx = <optimized out>
---Type <return> to continue, or q <return> to quit---
process = <optimized out>
t = <optimized out>
max_processed = <optimized out>
#14 0x000055e207959c51 in run_poll_loop () at src/haproxy.c:2386
next = <optimized out>
exp = <optimized out>
#15 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2451
ptif = <optimized out>
ptdf = <optimized out>
start_lock = 0
#16 0x00007f78f9f27494 in start_thread (arg=0x7f78f8522700) at
pthread_create.c:333
__res = <optimized out>
pd = 0x7f78f8522700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140157538936576,
-2788191698922495525, 0, 140727734888799, 0,
140157575090240,
2864195697895809499, 2864194293716885979},
mask_was_saved = 0}},
priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,
cleanup = 0x0,
canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#17 0x00007f78f9266acf in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals
I am coming back to this issue first because it may lead to a memory
corruption.
On my PC if I add traces in ssl_sock_prepare_srv_ctx(), when it is
called by hlua_init(), global.nbthread equals to 1 (which is the default
value of global nbthread value before parsing the configuration files),
should be 3.
Note that we allocate exactly global.nbthread pointers to struct
reused_sess in this function. When we enter later in ssl_sock_init(), we
initialize ptr (above in the trace) like that:
const unsigned char *ptr =
objt_server(conn->target)->ssl_ctx.reused_sess[tid].ptr
when tid equals to 0 or 1, haproxy does not crash. It crashes when tid
equals to 2.
So, the bug here is that hlua_init() is called too soon, or at least
ssl_sock_prepare_srv_ctx() for the LUA SSL server socket.
I hope this bug is in relation with other ones.
I am going to open another thread to post a patch.
Regards.