And I have a debug trace, below :) Any ideas on how to deal with this bug yet?
> > On Tue, 28 Jan 2014, dormando wrote: > > > > > > On Mon, Jan 13, 2014 at 9:22 AM, dormando <[email protected]> wrote: > > > > > > > > > > > There's no forwarder from the central server, if I understand what > > > > > > you > > > > > > were saying. Also given the strace of the select (which is > > > > > > selecting for > > > > > > nothing and waiting forever?) is it checking for something > > > > > > inbetween the > > > > > > select runs? > > > > > > > > > > > > > > > > > OK, the configs I think look good. What exactly it is waiting for is > > > > > hard > > > > > to tell without context. Can you gather a debug log while it is in > > > > > this > > > > > state. Details are here: > > > > > > > > > > http://www.rsyslog.com/doc/troubleshoot.html > > > > > > > > > > If a full debug log is too much to do, you may want to use debug on > > > > > demand: > > > > > > > > > > http://www.rsyslog.com/how-to-use-debug-on-demand/ > > > > > > > > > > If that's also impossible, at least many more strace lines PLUS > > > > > information > > > > > on what the thread name is (obtained via top) may also help us get > > > > > closer. > > > > > > > > on-demand might work. It'll take a few days to iterate through this due > > > > to > > > > the infrequency of the hang. Thought it was hung just now but it started > > > > working again... I guess it could also be intermittent. Maybe I'll set > > > > up > > > > a script to watch for log gaps and snapshot the debug stuff. > > > > > > > > Thanks for your fast responses! > > > > > > > > > > Well, it took a week to fail that time :/ > > > > > > strace was showing a select retry that was going to take over a day, so I > > > didn't wait for it to catch the lines inbetween :/ I might have to patch > > > the code to set an upper limit to whatever backoff that is? Or I could've > > > tried to interrupt it... > > > > > > im:imtcp - just one thread, sitting in 100% CPU on one core. > > > > > > Local logging continues to work... I see log lines show up for stuff I do > > > *on* the log box. no remote log lines appear to be incoming. > > > > > > I took some gdb backtrace snapshots since that's the best I could do this > > > time. Will hopefully get more strace and the debug output next time... if > > > the thread will wake up and start applying it. > > > > > > First below are three summary backtraces which show some threads changing > > > state, and the rest not doing much. Below that is one full all-thread > > > stack trace. > > > > > > These are taken from the hung process. > > > > > > summary of backtrace: > > > 4 pthread_cond_wait@@GLIBC_2.3.2,wrkr,start_thread,clone,?? > > > 1 select,runInput,thrdStarter,start_thread,clone,?? > > > 1 select,mainloop,realMain,main > > > 1 > > > read,readklog,klogLogKMsg,runInput,thrdStarter,start_thread,clone,?? > > > 1 > > > pthread_cond_wait@@GLIBC_2.3.2,doIdleProcessing,wtiWorker,wtpWorker,start_thread,clone,?? > > > 1 > > > IsReady,IsReady,RunSelect,Run,runInput,thrdStarter,start_thread,clone,?? > > > > > > another snapshot: > > > 4 pthread_cond_wait@@GLIBC_2.3.2,wrkr,start_thread,clone,?? > > > 1 > > > TCPSessGetNxtSess,RunSelect,Run,runInput,thrdStarter,start_thread,clone,?? > > > 1 select,runInput,thrdStarter,start_thread,clone,?? > > > 1 select,mainloop,realMain,main > > > 1 > > > read,readklog,klogLogKMsg,runInput,thrdStarter,start_thread,clone,?? > > > 1 > > > pthread_cond_wait@@GLIBC_2.3.2,doIdleProcessing,wtiWorker,wtpWorker,start_thread,clone,?? > > > > > > another snapshot: > > > 4 pthread_cond_wait@@GLIBC_2.3.2,wrkr,start_thread,clone,?? > > > 1 select,runInput,thrdStarter,start_thread,clone,?? > > > 1 select,mainloop,realMain,main > > > 1 > > > read,readklog,klogLogKMsg,runInput,thrdStarter,start_thread,clone,?? > > > 1 > > > pthread_cond_wait@@GLIBC_2.3.2,doIdleProcessing,wtiWorker,wtpWorker,start_thread,clone,?? > > > 1 > > > pNsd=0x7f80142e8990,,Add,Add,RunSelect,Run,runInput,thrdStarter,start_thread,clone,?? > > > > > > > > > Full stack trace: > > > [Thread debugging using libthread_db enabled] > > > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > > > [New Thread 0x7f80137fd700 (LWP 31775)] > > > [New Thread 0x7f8013bfe700 (LWP 31774)] > > > [New Thread 0x7f8013fff700 (LWP 31773)] > > > [New Thread 0x7f8018e81700 (LWP 31772)] > > > [New Thread 0x7f8018a80700 (LWP 31771)] > > > [New Thread 0x7f8019682700 (LWP 31770)] > > > [New Thread 0x7f8019e83700 (LWP 31769)] > > > [New Thread 0x7f801a684700 (LWP 31768)] > > > 0x00007f801d359763 in select () from /lib/x86_64-linux-gnu/libc.so.6 > > > > > > Thread 9 (Thread 0x7f801a684700 (LWP 31768)): > > > #0 0x00007f801d359763 in select () from /lib/x86_64-linux-gnu/libc.so.6 > > > #1 0x00007f801cbf44cd in runInput (pThrd=0x2550060) at imuxsock.c:1363 > > > #2 0x00000000004638b6 in thrdStarter (arg=0x2550060) at ../threads.c:212 > > > #3 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #4 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #5 0x0000000000000000 in ?? () > > > > > > Thread 8 (Thread 0x7f8019e83700 (LWP 31769)): > > > #0 0x00007f801e057d2d in read () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #1 0x00007f801c9dd001 in readklog (pModConf=0x252f940) at bsd.c:228 > > > #2 0x00007f801c9dd27e in klogLogKMsg (pModConf=0x252f940) at bsd.c:286 > > > #3 0x00007f801c9db926 in runInput (pThrd=0x2559cd0) at imklog.c:282 > > > #4 0x00000000004638b6 in thrdStarter (arg=0x2559cd0) at ../threads.c:212 > > > #5 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #6 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #7 0x0000000000000000 in ?? () > > > > > > Thread 7 (Thread 0x7f8019682700 (LWP 31770)): > > > #0 0x00007f801ae2a2a0 in Add (pNsdsel=0x7f80141e7d80, > > > pNsd=0x7f80142e8990, > > > waitOp=NSDSEL_RD) at nsdsel_ptcp.c:82 > > > #1 0x00007f801bb94f46 in Add (pNsdsel=0x7f801417c020, > > > pNsd=0x7f80140e4420, > > > waitOp=NSDSEL_RD) at nsdsel_gtls.c:93 > > > #2 0x00007f801bfa3536 in Add (pThis=0x7f80144b32a0, pStrm=0x7f80142a8300, > > > waitOp=NSDSEL_RD) at nssel.c:142 > > > #3 0x00007f801bd9bea5 in RunSelect (pThis=0x252b5e0, > > > workset=0x7f8019681570, > > > sizeWorkset=128) at tcpsrv.c:758 > > > #4 0x00007f801bd9c29f in Run (pThis=0x252b5e0) at tcpsrv.c:868 > > > #5 0x00007f801c1a8351 in runInput (pThrd=0x2559e90) at imtcp.c:584 > > > #6 0x00000000004638b6 in thrdStarter (arg=0x2559e90) at ../threads.c:212 > > > #7 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #8 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #9 0x0000000000000000 in ?? () > > > > > > Thread 6 (Thread 0x7f8018a80700 (LWP 31771)): > > > #0 0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #1 0x0000000000450dab in doIdleProcessing (pThis=0x25555d0, > > > pWtp=0x255f9f0, > > > pbInactivityTOOccured=0x7f8018a7fd18) at wti.c:252 > > > #2 0x0000000000451082 in wtiWorker (pThis=0x25555d0) at wti.c:325 > > > #3 0x0000000000450205 in wtpWorker (arg=0x25555d0) at wtp.c:388 > > > #4 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #5 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #6 0x0000000000000000 in ?? () > > > > > > Thread 5 (Thread 0x7f8018e81700 (LWP 31772)): > > > #0 0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #1 0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0bc0) at tcpsrv.c:631 > > > #2 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #3 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #4 0x0000000000000000 in ?? () > > > > > > Thread 4 (Thread 0x7f8013fff700 (LWP 31773)): > > > #0 0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #1 0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0c28) at tcpsrv.c:631 > > > #2 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #3 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #4 0x0000000000000000 in ?? () > > > > > > Thread 3 (Thread 0x7f8013bfe700 (LWP 31774)): > > > #0 0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #1 0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0c90) at tcpsrv.c:631 > > > #2 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #3 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #4 0x0000000000000000 in ?? () > > > > > > Thread 2 (Thread 0x7f80137fd700 (LWP 31775)): > > > #0 0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #1 0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0cf8) at tcpsrv.c:631 > > > #2 0x00007f801e050e9a in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #3 0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > > > #4 0x0000000000000000 in ?? () > > > > > > Thread 1 (Thread 0x7f801e691740 (LWP 31767)): > > > #0 0x00007f801d359763 in select () from /lib/x86_64-linux-gnu/libc.so.6 > > > #1 0x000000000040f22c in mainloop () at syslogd.c:1286 > > > #2 0x00000000004108da in realMain (argc=4, argv=0x7fffad7a59b8) at > > > syslogd.c:2038 > > > #3 0x00000000004109ba in main (argc=4, argv=0x7fffad7a59b8) at > > > syslogd.c:2068 > > > > > > > I didn't fatfinger the strace this time: the im:imtcp thread is still > > hung. All other threads seem to be doing stuff via the strace. The "hung" > > thread is spewing this: > > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource > > temporarily unavailable) > > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource > > temporarily unavailable) > > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource > > temporarily unavailable) > > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource > > temporarily unavailable) > > > > ... about as fast as it can execute them. > > > > It's always the same fd (58), which is just one random tcp connection from > > some random client. > > ... hit send a bit too early, since I am a failure... A couple backtrace > snapshots from the thread in question: > > Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)): > #0 0x00007f49fdb7f522 in Add (pThis=0x7f49f4285be0, pStrm=0x7f49f4027c60, > waitOp=NSDSEL_RD) at nssel.c:142 > #1 0x00007f49fd977ea5 in RunSelect (pThis=0x156f5e0, > workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:758 > #2 0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868 > #3 0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584 > #4 0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212 > #5 0x00007f49ffc2ce9a in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #6 0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #7 0x0000000000000000 in ?? () > > > Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)): > #0 0x00007f49fd4e98c6 in gnutls_record_get_direction () from > /usr/lib/x86_64-linux-gnu/libgnutls.so.26 > #1 0x00007f49fd770ec3 in Add (pNsdsel=0x7f49f42ec1e0, > pNsd=0x7f49f40e7700, waitOp=NSDSEL_RD) at nsdsel_gtls.c:83 > #2 0x00007f49fdb7f536 in Add (pThis=0x7f49f4285be0, pStrm=0x7f49f40f46d0, > waitOp=NSDSEL_RD) at nssel.c:142 > #3 0x00007f49fd977ea5 in RunSelect (pThis=0x156f5e0, > workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:758 > #4 0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868 > #5 0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584 > #6 0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212 > #7 0x00007f49ffc2ce9a in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #8 0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #9 0x0000000000000000 in ?? () > > Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)): > #0 0x00007f49fd976ca6 in TCPSessGetNxtSess (pThis=0x156f5e0, iCurr=161) > at tcpsrv.c:257 > #1 0x00007f49fd977ec6 in RunSelect (pThis=0x156f5e0, > workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:760 > #2 0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868 > #3 0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584 > #4 0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212 > #5 0x00007f49ffc2ce9a in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #6 0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #7 0x0000000000000000 in ?? () > > Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)): > #0 gtlsHasRcvInBuffer (pThis=0x7f49f403a4b0) at nsd_gtls.h:84 > #1 0x00007f49fd770e6f in Add (pNsdsel=0x7f49f42ec1e0, > pNsd=0x7f49f403a4b0, waitOp=NSDSEL_RD) at nsdsel_gtls.c:75 > #2 0x00007f49fdb7f536 in Add (pThis=0x7f49f4285be0, pStrm=0x7f49f403a1c0, > waitOp=NSDSEL_RD) at nssel.c:142 > #3 0x00007f49fd977ea5 in RunSelect (pThis=0x156f5e0, > workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:758 > #4 0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868 > #5 0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584 > #6 0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212 > #7 0x00007f49ffc2ce9a in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #8 0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #9 0x0000000000000000 in ?? () > > Thanks! Still no debug trace, but this should be pretty enlightening at > this point? 6291.841035907:7f57465c1700: GnuTLS requested retry of 2 operation - executing 6291.841038470:7f57465c1700: retrying gtls recv, nsd: 0x7f5740268240 6291.841043219:7f57465c1700: GnuTLS receive requires a retry (this most probably is OK and no error condition) 6291.841046425:7f57465c1700: gtlsRecordRecv return. nsd 0x7f5740268240, iRet -2100, lenRcvd -28, lenRcvBuf -1, ptrRcvBuf 2269 6291.841049194:7f57465c1700: XXXXXX: doRetry: iRet -2100, pNsd->bAbortConn 0 6291.841051917:7f57465c1700: tcpsrv: ready to process 1 event entries 6291.841054615:7f57465c1700: tcpsrv: processing item 141, pUsr 0x7f574024b730, bAbortConn 6291.841057347:7f57465c1700: netstream 0x7f574024b660 with new data 6291.841061298:7f57465c1700: GnuTLS receive requires a retry (this most probably is OK and no error condition) 6291.841064556:7f57465c1700: gtlsRecordRecv return. nsd 0x7f5740268240, iRet -2100, lenRcvd -28, lenRcvBuf -1, ptrRcvBuf 2269 6291.841067560:7f57465c1700: gtlsRcv return. nsd 0x7f5740268240, iRet -2100, lenRcvBuf -1, ptrRcvBuf 2269 6291.841072960:7f57465c1700: hasRcvInBuffer on nsd 0x10b7240: pszRcvBuf (nil), lenRcvBuf 0 6291.841076130:7f57465c1700: hasRcvInBuffer on nsd 0x10c1040: pszRcvBuf (nil), lenRcvBuf 0 6291.841079174:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740000ff0: pszRcvBuf 0x7f573c0008c0, lenRcvBuf -1 6291.841082109:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740008af0: pszRcvBuf 0x7f57280008c0, lenRcvBuf -1 6291.841084987:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740013c70: pszRcvBuf 0x7f57300008c0, lenRcvBuf -1 6291.841087731:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740014cb0: pszRcvBuf 0x7f573c007490, lenRcvBuf -1 [....snip, repeats a large number of times lines with changing nsd/pszRcvBuf...] 6291.841561956:7f57465c1700: nsdsel_gtls: doing dummy select, data present 6291.841564871:7f57465c1700: hasRcvInBuffer on nsd 0x10b7240: pszRcvBuf (nil), lenRcvBuf 0 6291.841567257:7f57465c1700: nsd_gtls: dummy read, buffer not available for this FD 6291.841570053:7f57465c1700: hasRcvInBuffer on nsd 0x10c1040: pszRcvBuf (nil), lenRcvBuf 0 6291.841572489:7f57465c1700: nsd_gtls: dummy read, buffer not available for this FD 6291.841578650:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740000ff0: pszRcvBuf 0x7f573c0008c0, lenRcvBuf -1 6291.841581387:7f57465c1700: nsd_gtls: dummy read, buffer not available for this FD 6291.841584133:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740008af0: pszRcvBuf 0x7f57280008c0, lenRcvBuf -1 6291.841586528:7f57465c1700: nsd_gtls: dummy read, buffer not available for this FD 6291.841589417:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740013c70: pszRcvBuf 0x7f57300008c0, lenRcvBuf -1 [... last two lines repeat with different nsd,pszRcvBuf values for a while...] It made about a million lines of this in a few seconds. I don't see anything else in the debug log. This time it took ~6 days to crash again. What else can I do to help? thanks! _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

