Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25

Frederic Lecaille Wed, 29 Aug 2018 05:05:18 -0700

On 08/28/2018 11:19 AM, Frederic Lecaille wrote:

On 08/27/2018 10:46 PM, PiBa-NL wrote:
Hi Frederic, Oliver,
Hi Pieter,
Thanks for your investigations :).
I've made a little reg-test (files attached). Its probably not'correct' to commit as-is, but should be enough to get areproduction.. I hope..
changing it to nbthread 1 makes it work every time..(that i tried)
Your script is correct. Thank you a lot for this Pieter.
The test actually seems to show a variety of issues.
## Every once in a while it takes like 7 seconds to run a test..During which cpu usage is high..
Sounds like the first issue you reported. You can use -t varnistestoption to set a large timeout so that you might have enough time to killvarnistest (Ctrl+C) to prevent it to kill haproxy. Then you can attachgdb to the haproxy process.
     ---- c0    7.6 HTTP rx timeout (fd:5 7500 ms)
## But most of the time, it just doesn't finish with a correct result(ive seen haproxy do core dumps also while testing..). There is ofcourse the option that i did something wrong in the lua as well...
Does the test itself work for you guys? (with nbthread 1)
I have not managed to make this script fails with "nbthread 1".
I have also seen coredumps with "nbthread 3" even with only one HTTPrequest from c0 client:
     client c0 -connect ${h1_fe1_sock} {
         txreq -url "/"
         rxresp
         expect resp.status == 200
     }
If you run varnishtest with -l option, it leaves the temporary vtc.*directory if the test failed.
If you set your environment to produce coredumps (ulimit -c unlimited)you should find coredump files in /tmp/vtc.*/<haproxy instance name>/directory (/tmp/vtc.*/h1/ in our case).
According to gdb we have an issue in src/ssl_sock.c. So I have CC thismail to Emeric:
Reading symbols from haproxy...done.
[New LWP 32432]
[New LWP 32431]
[New LWP 32428]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/flecaille/src/haproxy/haproxy -d -f/tmp/vtc.32410.6f80f987/h1/cfg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f78f98bba56 in ASN1_get_object ()
    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
[Current thread is 1 (Thread 0x7f78f8522700 (LWP 32432))]
(gdb) bt full
#0  0x00007f78f98bba56 in ASN1_get_object ()
    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#1 0x00007f78f98c2ff8 in ?? () from/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#2 0x00007f78f98c41b5 in ?? () from/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#3  0x00007f78f98c4ead in ASN1_item_ex_d2i ()
    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#4  0x00007f78f98c4f2b in ASN1_item_d2i ()
    from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
No symbol table info available.
#5  0x00007f78f9cdac98 in d2i_SSL_SESSION ()
    from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
No symbol table info available.
#6 0x000055e2078be006 in ssl_sock_init (conn=0x7f78e8012220) atsrc/ssl_sock.c:4985 ptr = 0xf800000000000000 <error: Cannot access memory ataddress 0xf800000000000000>
         sess = <optimized out>
         may_retry = <optimized out>
         conn = 0x7f78e8012220
#7  0x000055e20797dfc1 in conn_xprt_init (conn=0x7f78e8012220)
     at include/proto/connection.h:84
         ret = 0
#8 tcp_connect_server (conn=0x7f78e8012220, data=0, delack=<optimizedout>)
     at src/proto_tcp.c:545
         fd = 18
         srv = <optimized out>
         be = 0x55e207c567e0 <socket_proxy>
         src = <optimized out>
#9  0x000055e207981aba in si_connect (si=0x7f78e8017680)
     at include/proto/stream_interface.h:366
         ret = 0
#10 connect_server (s=s@entry=0x7f78e80173b0) at src/backend.c:1223
         cli_conn = 0x0
         srv_conn = 0x7f78e8012220
         srv_cs = <optimized out>
         old_cs = <optimized out>
         reuse = <optimized out>
         err = <optimized out>
#11 0x000055e207924295 in sess_update_stream_int (s=0x7f78e80173b0) atsrc/stream.c:885
         conn_err = <optimized out>
         si = 0x7f78e8017680
         req = 0x7f78e80173c0
#12 process_stream (t=<optimized out>, context=0x7f78e80173b0,state=<optimized out>)
     at src/stream.c:2240
         s = 0x7f78e80173b0
         sess = <optimized out>
         rqf_last = <optimized out>
         rpf_last = 2147483648
         rq_prod_last = <optimized out>
         rq_cons_last = <optimized out>
         rp_cons_last = <optimized out>
         rp_prod_last = <optimized out>
         req_ana_back = <optimized out>
         req = 0x7f78e80173c0
         res = 0x7f78e8017420
         si_f = 0x7f78e8017638
         si_b = 0x7f78e8017680
#13 0x000055e2079ab1f8 in process_runnable_tasks () at src/task.c:381
         t = <optimized out>
         state = <optimized out>
         ctx = <optimized out>
---Type <return> to continue, or q <return> to quit---
         process = <optimized out>
         t = <optimized out>
         max_processed = <optimized out>
#14 0x000055e207959c51 in run_poll_loop () at src/haproxy.c:2386
         next = <optimized out>
         exp = <optimized out>
#15 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2451
         ptif = <optimized out>
         ptdf = <optimized out>
         start_lock = 0
#16 0x00007f78f9f27494 in start_thread (arg=0x7f78f8522700) atpthread_create.c:333
         __res = <optimized out>
         pd = 0x7f78f8522700
         now = <optimized out>
         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140157538936576,
-2788191698922495525, 0, 140727734888799, 0,140157575090240, 2864195697895809499, 2864194293716885979},mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0,cleanup = 0x0,
               canceltype = 0}}}
         not_first_call = <optimized out>
         pagesize_m1 = <optimized out>
         sp = <optimized out>
         freesize = <optimized out>
         __PRETTY_FUNCTION__ = "start_thread"
#17 0x00007f78f9266acf in clone () at../sysdeps/unix/sysv/linux/x86_64/clone.S:97
No locals

I am coming back to this issue first because it may lead to a memorycorruption.

On my PC if I add traces in ssl_sock_prepare_srv_ctx(), when it iscalled by hlua_init(), global.nbthread equals to 1 (which is the defaultvalue of global nbthread value before parsing the configuration files),should be 3.

Note that we allocate exactly global.nbthread pointers to structreused_sess in this function. When we enter later in ssl_sock_init(), weinitialize ptr (above in the trace) like that:

const unsigned char *ptr =objt_server(conn->target)->ssl_ctx.reused_sess[tid].ptr

when tid equals to 0 or 1, haproxy does not crash. It crashes when tidequals to 2.

So, the bug here is that hlua_init() is called too soon, or at leastssl_sock_prepare_srv_ctx() for the LUA SSL server socket.


I hope this bug is in relation with other ones.

I am going to open another thread to post a patch.

Regards.

Re: lua script, 200% cpu usage with nbthread 3 - haproxy hangs - __spin_lock - HA-Proxy version 1.9-dev1-e3faf02 2018/08/25

Reply via email to