Re: dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier
Hi Pieter, On Tue, Jun 12, 2018 at 08:09:08PM +0200, PiBa-NL wrote: > Is there something i can do to find out more info if it happens again? Or > maybe before that build with more specific debug info so if it happens again > more info would be retrievable.?. > (My usual way of 'debugging' with the relatively easy to reproduce issues is > just cramming a lot of printf statements into the code until something makes > sense., but with a issue that only happens once in a blue moon (sofar), that > doesn't work very well...) Building with -DDEBUG_THREAD and -DDEBUG_MEMORY can help to get a more exploitable core file which will reveal where a given lock was already taken. Be careful that it will slightly increase the CPU usage though. > For the moment it hasn't happened again. i suspend/resume the vm it happened > on almost daily (that's running on my workstation witch is shutdown > overnight) the vm also has haproxy running on it and some other stuff.. > > If it does happen again, would there be any other gdb information helpful? > Inspecting specific variables or creating a memory dump or something.? > (please give a hint about the comment/option to call if applicable/possible, > i'm still a rookie with gdb..) p.s. i'm on FreeBSD not sure if that matters > for some of gdb's available options or something.. Creating a dump is often quite useful. You can do it with the command "generate-core-file". A "bt full" and "info thread" will also be needed as you did last time. I cannot guide you through the commands used to debug the threads however because I don't know them, I think I remember you need to print the contents of the lock itself (eg: "p foo->lock"). > Would the complete configuration be helpfull? There is a lot of useless > stuff in there. because its my test/development test vm, and because of lack > of it happening again there is no good opportunity to shrink it down to > specific options/parts.. Well probably not, though if at least to keep it unmodified with the unstripped executable and the upcoming core, that may be useful afterwards. Thanks! Willy
Re: dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier
Hi Willy, Op 12-6-2018 om 14:31 schreef Willy Tarreau: This one is not known yet, to the best of my knowledge, or at least not reported yet. Okay :) I guess ill keep an eye on if it happens again. Is there something i can do to find out more info if it happens again? Or maybe before that build with more specific debug info so if it happens again more info would be retrievable.?. (My usual way of 'debugging' with the relatively easy to reproduce issues is just cramming a lot of printf statements into the code until something makes sense., but with a issue that only happens once in a blue moon (sofar), that doesn't work very well...) For the moment it hasn't happened again. i suspend/resume the vm it happened on almost daily (that's running on my workstation witch is shutdown overnight) the vm also has haproxy running on it and some other stuff.. If it does happen again, would there be any other gdb information helpful? Inspecting specific variables or creating a memory dump or something.? (please give a hint about the comment/option to call if applicable/possible, i'm still a rookie with gdb..) p.s. i'm on FreeBSD not sure if that matters for some of gdb's available options or something.. Would the complete configuration be helpfull? There is a lot of useless stuff in there. because its my test/development test vm, and because of lack of it happening again there is no good opportunity to shrink it down to specific options/parts.. Thanks for your analysis, though i'm not sure what to do with the info at the moment, i do hope you guys together can find a likely culprit from the information given ;). Regards, PiBa-NL (Pieter)
Re: dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier
Hi Pieter, On Mon, Jun 11, 2018 at 10:48:25PM +0200, PiBa-NL wrote: > Hi List, > > I've got no clue how i got into this state ;) and maybe there is nothing > wrong..(well i did resume a VM that was suspended for half a day..) > > Still thought it might be worth reporting, or perhaps its solved already as > there are a few fixes for threads after the 6-6 snapshot that i build with.. > Sometimes all that some people need is half a idea to find a problem... So > maybe there is something that needs fixing?? This one is not known yet, to the best of my knowledge, or at least not reported yet. > (gdb) info threads > Id Target Id Frame > * 1 LWP 100660 of process 56253 0x005b0202 in thread_sync_barrier > (barrier=0x8bc690 ) at src/hathreads.c:109 > 2 LWP 101036 of process 56253 0x0050874a in process_chk_conn > (t=0x8025187c0, context=0x802482610, state=33) at src/checks.c:2112 > 3 LWP 101037 of process 56253 0x0050b58e in > enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600, > msg=0x7fffdfdfc770 "Health check for server Test-SNI_ipvANY/srv451-4 > failed, reason: Layer4 connection problem, info: \"General socket error > (Network is unreachable)\", check duration: 0ms, status: 0/2 DOWN") at > src/checks.c:3396 This is quite odd. Either some quit a function without releasing the server's lock nor the email_queue's lock (suspicious but possible), or both just happen to be the same. And at first glance, seeing that process_chk_conn() is called witha context of 0x802482610 which is only 16 bytes above enqueue_one_email_alert's queue parameter, this last possibility suddenly seems quite likely. So my guess is that we take the server's lock in process_chk_conn(), that we go down through this : chk_report_conn_err() -> set_server_check_status() -> send_email_alert() -> enqueue_email_alert() -> enqueue_one_email_alert() And here we spin on a lock which very likely is the same eventhough I have no idea why. I suspect the way the queue or lock is retrieved there is incorrect and explains the situation with a recursive lock, but I'm afraid I don't really know since I don't know this part of the code :-( Note, it could also be that the queue's lock has never been initialized, and makes the inner lock block there, with the server's lock held, causing the second thread to spin and the 3rd one to wait for the barrier. I'll have to ask for some help on this part. I'm pretty confident that 1.8 is affected as well. Thanks, Willy
dev1.9 2018/06/05 threads cpu 100% spin_lock v.s. thread_sync_barrier
Hi List, I've got no clue how i got into this state ;) and maybe there is nothing wrong..(well i did resume a VM that was suspended for half a day..) Still thought it might be worth reporting, or perhaps its solved already as there are a few fixes for threads after the 6-6 snapshot that i build with.. Sometimes all that some people need is half a idea to find a problem... So maybe there is something that needs fixing?? Haproxy running with 3 threads at 300% cpu usage, .. some lua, almost no traffic, in a vm that just resumed operation and is still going through its passes to initialize its nic's and some stuff that noticed the clock jumped on its back and its dhcp lease expired or something like that.. anyhow lots of things going on at that moment.. Below some of the details ive got about the threads, one is spinning, the others seemingly waiting for spin_locks.?. Like i wrote, not sure if its 'something' and i don't know yet if i can reproduce it a second time. If more info is needed, please let me know and ill try and provide it. But at the moment its a 1 time occurrence, i think.. If there is nothing obvious wrong, for now maybe just ignore this mail. Also ill update to latest snapshot 2018/06/08. Maybe i wont see it ever again :). haproxy -vv HA-Proxy version 1.9-dev0-cc0a957 2018/06/05 Copyright 2000-2017 Willy Tarreau Build options : TARGET = freebsd CPU = generic CC = cc CFLAGS = -pipe -g -fstack-protector -fno-strict-aliasing -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-address-of-packed-member -Wno-null-dereference -Wno-unused-label -DFREEBSD_PORTS OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_CPU_AFFINITY=1 USE_ACCEPT4=1 USE_REGPARM=1 USE_OPENSSL=1 USE_LUA=1 USE_STATIC_PCRE=1 USE_PCRE_JIT=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Built with network namespace support. Built with zlib version : 1.2.11 Running on zlib version : 1.2.11 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with PCRE version : 8.40 2017-01-11 Running on PCRE version : 8.40 2017-01-11 PCRE library supports JIT : yes Built with multi-threading support. Encrypted password support via crypt(3): yes Built with transparent proxy support using: IP_BINDANY IPV6_BINDANY Built with Lua version : Lua 5.3.4 Built with OpenSSL version : OpenSSL 1.0.2k-freebsd 26 Jan 2017 Running on OpenSSL version : OpenSSL 1.0.2o-freebsd 27 Mar 2018 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : SSLv3 TLSv1.0 TLSv1.1 TLSv1.2 Available polling systems : kqueue : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use kqueue. Available filters : [TRACE] trace [COMP] compression [SPOE] spoe (gdb) info threads Id Target Id Frame * 1 LWP 100660 of process 56253 0x005b0202 in thread_sync_barrier (barrier=0x8bc690 ) at src/hathreads.c:109 2 LWP 101036 of process 56253 0x0050874a in process_chk_conn (t=0x8025187c0, context=0x802482610, state=33) at src/checks.c:2112 3 LWP 101037 of process 56253 0x0050b58e in enqueue_one_email_alert (p=0x80253f400, s=0x8024dec00, q=0x802482600, msg=0x7fffdfdfc770 "Health check for server Test-SNI_ipvANY/srv451-4 failed, reason: Layer4 connection problem, info: \"General socket error (Network is unreachable)\", check duration: 0ms, status: 0/2 DOWN") at src/checks.c:3396 (gdb) next 110 in src/hathreads.c (gdb) next 109 in src/hathreads.c (gdb) next 110 in src/hathreads.c (gdb) next 109 in src/hathreads.c (gdb) next 110 in src/hathreads.c (gdb) next 109 in src/hathreads.c (gdb) next Command name abbreviations are allowed if unambiguous. (gdb) thread 1 [Switching to thread 1 (LWP 100660 of process 56253)] #0 thread_sync_barrier (barrier=0x8bc690 ) at src/hathreads.c:109 109 src/hathreads.c: No such file or directory. (gdb) bt full #0 thread_sync_barrier (barrier=0x8bc690 ) at src/hathreads.c:109 old = 7 #1 0x005b0038 in thread_enter_sync () at src/hathreads.c:122 barrier = 1 #2 0x0051737c in sync_poll_loop () at src/haproxy.c:2380 No locals. #3 0x005172ed in run_poll_loop () at src/haproxy.c:2432 next = -357534104 exp = -357534104 #4 0x00514670 in run_thread_poll_loop (data=0x802491380) at src/haproxy.c:2462 start_lock = 0 ptif = 0x8af8f8 ptdf = 0x7fffec80 #5 0x00511199 in main (argc=10, argv=0x7fffec28) at src/haproxy.c:3052 tids = 0x802491380 threads = 0x8024831a0 i = 3 err = 0 retry = 200 limit = {rlim_cur = 6116, rlim_max = 6116} errmsg = "\000\354\377\377\377\177\000\000\2