And I have a debug trace, below :) Any ideas on how to deal with this bug
yet?

>
> On Tue, 28 Jan 2014, dormando wrote:
>
> > > > > On Mon, Jan 13, 2014 at 9:22 AM, dormando <[email protected]> wrote:
> > > > >
> > > > > > There's no forwarder from the central server, if I understand what 
> > > > > > you
> > > > > > were saying. Also given the strace of the select (which is 
> > > > > > selecting for
> > > > > > nothing and waiting forever?) is it checking for something 
> > > > > > inbetween the
> > > > > > select runs?
> > > > > >
> > > > > >
> > > > > OK, the configs I think look good. What exactly it is waiting for is 
> > > > > hard
> > > > > to tell without context. Can you gather a debug log while it is in 
> > > > > this
> > > > > state. Details are here:
> > > > >
> > > > > http://www.rsyslog.com/doc/troubleshoot.html
> > > > >
> > > > > If a full debug log is too much to do, you may want to use debug on 
> > > > > demand:
> > > > >
> > > > > http://www.rsyslog.com/how-to-use-debug-on-demand/
> > > > >
> > > > > If that's also impossible, at least many more strace lines PLUS 
> > > > > information
> > > > > on what the thread name is (obtained via top) may also help us get 
> > > > > closer.
> > > >
> > > > on-demand might work. It'll take a few days to iterate through this due 
> > > > to
> > > > the infrequency of the hang. Thought it was hung just now but it started
> > > > working again... I guess it could also be intermittent. Maybe I'll set 
> > > > up
> > > > a script to watch for log gaps and snapshot the debug stuff.
> > > >
> > > > Thanks for your fast responses!
> > > >
> > >
> > > Well, it took a week to fail that time :/
> > >
> > > strace was showing a select retry that was going to take over a day, so I
> > > didn't wait for it to catch the lines inbetween :/ I might have to patch
> > > the code to set an upper limit to whatever backoff that is? Or I could've
> > > tried to interrupt it...
> > >
> > > im:imtcp - just one thread, sitting in 100% CPU on one core.
> > >
> > > Local logging continues to work... I see log lines show up for stuff I do
> > > *on* the log box. no remote log lines appear to be incoming.
> > >
> > > I took some gdb backtrace snapshots since that's the best I could do this
> > > time. Will hopefully get more strace and the debug output next time... if
> > > the thread will wake up and start applying it.
> > >
> > > First below are three summary backtraces which show some threads changing
> > > state, and the rest not doing much. Below that is one full all-thread
> > > stack trace.
> > >
> > > These are taken from the hung process.
> > >
> > > summary of backtrace:
> > >       4 pthread_cond_wait@@GLIBC_2.3.2,wrkr,start_thread,clone,??
> > >       1 select,runInput,thrdStarter,start_thread,clone,??
> > >       1 select,mainloop,realMain,main
> > >       1 
> > > read,readklog,klogLogKMsg,runInput,thrdStarter,start_thread,clone,??
> > >       1
> > > pthread_cond_wait@@GLIBC_2.3.2,doIdleProcessing,wtiWorker,wtpWorker,start_thread,clone,??
> > >       1
> > > IsReady,IsReady,RunSelect,Run,runInput,thrdStarter,start_thread,clone,??
> > >
> > > another snapshot:
> > >       4 pthread_cond_wait@@GLIBC_2.3.2,wrkr,start_thread,clone,??
> > >       1
> > > TCPSessGetNxtSess,RunSelect,Run,runInput,thrdStarter,start_thread,clone,??
> > >       1 select,runInput,thrdStarter,start_thread,clone,??
> > >       1 select,mainloop,realMain,main
> > >       1 
> > > read,readklog,klogLogKMsg,runInput,thrdStarter,start_thread,clone,??
> > >       1
> > > pthread_cond_wait@@GLIBC_2.3.2,doIdleProcessing,wtiWorker,wtpWorker,start_thread,clone,??
> > >
> > > another snapshot:
> > >       4 pthread_cond_wait@@GLIBC_2.3.2,wrkr,start_thread,clone,??
> > >       1 select,runInput,thrdStarter,start_thread,clone,??
> > >       1 select,mainloop,realMain,main
> > >       1 
> > > read,readklog,klogLogKMsg,runInput,thrdStarter,start_thread,clone,??
> > >       1
> > > pthread_cond_wait@@GLIBC_2.3.2,doIdleProcessing,wtiWorker,wtpWorker,start_thread,clone,??
> > >       1
> > > pNsd=0x7f80142e8990,,Add,Add,RunSelect,Run,runInput,thrdStarter,start_thread,clone,??
> > >
> > >
> > > Full stack trace:
> > > [Thread debugging using libthread_db enabled]
> > > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> > > [New Thread 0x7f80137fd700 (LWP 31775)]
> > > [New Thread 0x7f8013bfe700 (LWP 31774)]
> > > [New Thread 0x7f8013fff700 (LWP 31773)]
> > > [New Thread 0x7f8018e81700 (LWP 31772)]
> > > [New Thread 0x7f8018a80700 (LWP 31771)]
> > > [New Thread 0x7f8019682700 (LWP 31770)]
> > > [New Thread 0x7f8019e83700 (LWP 31769)]
> > > [New Thread 0x7f801a684700 (LWP 31768)]
> > > 0x00007f801d359763 in select () from /lib/x86_64-linux-gnu/libc.so.6
> > >
> > > Thread 9 (Thread 0x7f801a684700 (LWP 31768)):
> > > #0  0x00007f801d359763 in select () from /lib/x86_64-linux-gnu/libc.so.6
> > > #1  0x00007f801cbf44cd in runInput (pThrd=0x2550060) at imuxsock.c:1363
> > > #2  0x00000000004638b6 in thrdStarter (arg=0x2550060) at ../threads.c:212
> > > #3  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #4  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #5  0x0000000000000000 in ?? ()
> > >
> > > Thread 8 (Thread 0x7f8019e83700 (LWP 31769)):
> > > #0  0x00007f801e057d2d in read () from 
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #1  0x00007f801c9dd001 in readklog (pModConf=0x252f940) at bsd.c:228
> > > #2  0x00007f801c9dd27e in klogLogKMsg (pModConf=0x252f940) at bsd.c:286
> > > #3  0x00007f801c9db926 in runInput (pThrd=0x2559cd0) at imklog.c:282
> > > #4  0x00000000004638b6 in thrdStarter (arg=0x2559cd0) at ../threads.c:212
> > > #5  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #6  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #7  0x0000000000000000 in ?? ()
> > >
> > > Thread 7 (Thread 0x7f8019682700 (LWP 31770)):
> > > #0  0x00007f801ae2a2a0 in Add (pNsdsel=0x7f80141e7d80, 
> > > pNsd=0x7f80142e8990,
> > > waitOp=NSDSEL_RD) at nsdsel_ptcp.c:82
> > > #1  0x00007f801bb94f46 in Add (pNsdsel=0x7f801417c020, 
> > > pNsd=0x7f80140e4420,
> > > waitOp=NSDSEL_RD) at nsdsel_gtls.c:93
> > > #2  0x00007f801bfa3536 in Add (pThis=0x7f80144b32a0, pStrm=0x7f80142a8300,
> > > waitOp=NSDSEL_RD) at nssel.c:142
> > > #3  0x00007f801bd9bea5 in RunSelect (pThis=0x252b5e0, 
> > > workset=0x7f8019681570,
> > > sizeWorkset=128) at tcpsrv.c:758
> > > #4  0x00007f801bd9c29f in Run (pThis=0x252b5e0) at tcpsrv.c:868
> > > #5  0x00007f801c1a8351 in runInput (pThrd=0x2559e90) at imtcp.c:584
> > > #6  0x00000000004638b6 in thrdStarter (arg=0x2559e90) at ../threads.c:212
> > > #7  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #8  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #9  0x0000000000000000 in ?? ()
> > >
> > > Thread 6 (Thread 0x7f8018a80700 (LWP 31771)):
> > > #0  0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #1  0x0000000000450dab in doIdleProcessing (pThis=0x25555d0, 
> > > pWtp=0x255f9f0,
> > > pbInactivityTOOccured=0x7f8018a7fd18) at wti.c:252
> > > #2  0x0000000000451082 in wtiWorker (pThis=0x25555d0) at wti.c:325
> > > #3  0x0000000000450205 in wtpWorker (arg=0x25555d0) at wtp.c:388
> > > #4  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #5  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #6  0x0000000000000000 in ?? ()
> > >
> > > Thread 5 (Thread 0x7f8018e81700 (LWP 31772)):
> > > #0  0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #1  0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0bc0) at tcpsrv.c:631
> > > #2  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #3  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #4  0x0000000000000000 in ?? ()
> > >
> > > Thread 4 (Thread 0x7f8013fff700 (LWP 31773)):
> > > #0  0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #1  0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0c28) at tcpsrv.c:631
> > > #2  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #3  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #4  0x0000000000000000 in ?? ()
> > >
> > > Thread 3 (Thread 0x7f8013bfe700 (LWP 31774)):
> > > #0  0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #1  0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0c90) at tcpsrv.c:631
> > > #2  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #3  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #4  0x0000000000000000 in ?? ()
> > >
> > > Thread 2 (Thread 0x7f80137fd700 (LWP 31775)):
> > > #0  0x00007f801e054d84 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #1  0x00007f801bd9b97c in wrkr (myself=0x7f801bfa0cf8) at tcpsrv.c:631
> > > #2  0x00007f801e050e9a in start_thread () from
> > > /lib/x86_64-linux-gnu/libpthread.so.0
> > > #3  0x00007f801d3603fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > > #4  0x0000000000000000 in ?? ()
> > >
> > > Thread 1 (Thread 0x7f801e691740 (LWP 31767)):
> > > #0  0x00007f801d359763 in select () from /lib/x86_64-linux-gnu/libc.so.6
> > > #1  0x000000000040f22c in mainloop () at syslogd.c:1286
> > > #2  0x00000000004108da in realMain (argc=4, argv=0x7fffad7a59b8) at
> > > syslogd.c:2038
> > > #3  0x00000000004109ba in main (argc=4, argv=0x7fffad7a59b8) at 
> > > syslogd.c:2068
> > >
> >
> > I didn't fatfinger the strace this time: the im:imtcp thread is still
> > hung. All other threads seem to be doing stuff via the strace. The "hung"
> > thread is spewing this:
> > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource
> > temporarily unavailable)
> > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource
> > temporarily unavailable)
> > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource
> > temporarily unavailable)
> > 10730 recvfrom(58, 0x7f49f41835e0, 461, 0, 0, 0) = -1 EAGAIN (Resource
> > temporarily unavailable)
> >
> > ... about as fast as it can execute them.
> >
> > It's always the same fd (58), which is just one random tcp connection from
> > some random client.
>
> ... hit send a bit too early, since I am a failure... A couple backtrace
> snapshots from the thread in question:
>
> Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)):
> #0  0x00007f49fdb7f522 in Add (pThis=0x7f49f4285be0, pStrm=0x7f49f4027c60,
> waitOp=NSDSEL_RD) at nssel.c:142
> #1  0x00007f49fd977ea5 in RunSelect (pThis=0x156f5e0,
> workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:758
> #2  0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868
> #3  0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584
> #4  0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212
> #5  0x00007f49ffc2ce9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #6  0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #7  0x0000000000000000 in ?? ()
>
>
> Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)):
> #0  0x00007f49fd4e98c6 in gnutls_record_get_direction () from
> /usr/lib/x86_64-linux-gnu/libgnutls.so.26
> #1  0x00007f49fd770ec3 in Add (pNsdsel=0x7f49f42ec1e0,
> pNsd=0x7f49f40e7700, waitOp=NSDSEL_RD) at nsdsel_gtls.c:83
> #2  0x00007f49fdb7f536 in Add (pThis=0x7f49f4285be0, pStrm=0x7f49f40f46d0,
> waitOp=NSDSEL_RD) at nssel.c:142
> #3  0x00007f49fd977ea5 in RunSelect (pThis=0x156f5e0,
> workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:758
> #4  0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868
> #5  0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584
> #6  0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212
> #7  0x00007f49ffc2ce9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #8  0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #9  0x0000000000000000 in ?? ()
>
> Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)):
> #0  0x00007f49fd976ca6 in TCPSessGetNxtSess (pThis=0x156f5e0, iCurr=161)
> at tcpsrv.c:257
> #1  0x00007f49fd977ec6 in RunSelect (pThis=0x156f5e0,
> workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:760
> #2  0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868
> #3  0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584
> #4  0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212
> #5  0x00007f49ffc2ce9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #6  0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #7  0x0000000000000000 in ?? ()
>
> Thread 7 (Thread 0x7f49fb25e700 (LWP 10730)):
> #0  gtlsHasRcvInBuffer (pThis=0x7f49f403a4b0) at nsd_gtls.h:84
> #1  0x00007f49fd770e6f in Add (pNsdsel=0x7f49f42ec1e0,
> pNsd=0x7f49f403a4b0, waitOp=NSDSEL_RD) at nsdsel_gtls.c:75
> #2  0x00007f49fdb7f536 in Add (pThis=0x7f49f4285be0, pStrm=0x7f49f403a1c0,
> waitOp=NSDSEL_RD) at nssel.c:142
> #3  0x00007f49fd977ea5 in RunSelect (pThis=0x156f5e0,
> workset=0x7f49fb25d570, sizeWorkset=128) at tcpsrv.c:758
> #4  0x00007f49fd97829f in Run (pThis=0x156f5e0) at tcpsrv.c:868
> #5  0x00007f49fdd84351 in runInput (pThrd=0x159de90) at imtcp.c:584
> #6  0x00000000004638b6 in thrdStarter (arg=0x159de90) at ../threads.c:212
> #7  0x00007f49ffc2ce9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #8  0x00007f49fef3c3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #9  0x0000000000000000 in ?? ()
>
> Thanks! Still no debug trace, but this should be pretty enlightening at
> this point?

6291.841035907:7f57465c1700: GnuTLS requested retry of 2 operation -
executing
6291.841038470:7f57465c1700: retrying gtls recv, nsd: 0x7f5740268240
6291.841043219:7f57465c1700: GnuTLS receive requires a retry (this most
probably is OK and no error condition)
6291.841046425:7f57465c1700: gtlsRecordRecv return. nsd 0x7f5740268240,
iRet -2100, lenRcvd -28, lenRcvBuf -1, ptrRcvBuf 2269
6291.841049194:7f57465c1700: XXXXXX: doRetry: iRet -2100, pNsd->bAbortConn
0
6291.841051917:7f57465c1700: tcpsrv: ready to process 1 event entries
6291.841054615:7f57465c1700: tcpsrv: processing item 141, pUsr
0x7f574024b730, bAbortConn
6291.841057347:7f57465c1700: netstream 0x7f574024b660 with new data
6291.841061298:7f57465c1700: GnuTLS receive requires a retry (this most
probably is OK and no error condition)
6291.841064556:7f57465c1700: gtlsRecordRecv return. nsd 0x7f5740268240,
iRet -2100, lenRcvd -28, lenRcvBuf -1, ptrRcvBuf 2269
6291.841067560:7f57465c1700: gtlsRcv return. nsd 0x7f5740268240, iRet
-2100, lenRcvBuf -1, ptrRcvBuf 2269
6291.841072960:7f57465c1700: hasRcvInBuffer on nsd 0x10b7240: pszRcvBuf
(nil), lenRcvBuf 0
6291.841076130:7f57465c1700: hasRcvInBuffer on nsd 0x10c1040: pszRcvBuf
(nil), lenRcvBuf 0
6291.841079174:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740000ff0:
pszRcvBuf 0x7f573c0008c0, lenRcvBuf -1
6291.841082109:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740008af0:
pszRcvBuf 0x7f57280008c0, lenRcvBuf -1
6291.841084987:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740013c70:
pszRcvBuf 0x7f57300008c0, lenRcvBuf -1
6291.841087731:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740014cb0:
pszRcvBuf 0x7f573c007490, lenRcvBuf -1
[....snip, repeats a large number of times lines with changing
nsd/pszRcvBuf...]

6291.841561956:7f57465c1700: nsdsel_gtls: doing dummy select, data present
6291.841564871:7f57465c1700: hasRcvInBuffer on nsd 0x10b7240: pszRcvBuf
(nil), lenRcvBuf 0
6291.841567257:7f57465c1700: nsd_gtls: dummy read, buffer not available
for this FD
6291.841570053:7f57465c1700: hasRcvInBuffer on nsd 0x10c1040: pszRcvBuf
(nil), lenRcvBuf 0
6291.841572489:7f57465c1700: nsd_gtls: dummy read, buffer not available
for this FD
6291.841578650:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740000ff0:
pszRcvBuf 0x7f573c0008c0, lenRcvBuf -1
6291.841581387:7f57465c1700: nsd_gtls: dummy read, buffer not available
for this FD
6291.841584133:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740008af0:
pszRcvBuf 0x7f57280008c0, lenRcvBuf -1
6291.841586528:7f57465c1700: nsd_gtls: dummy read, buffer not available
for this FD
6291.841589417:7f57465c1700: hasRcvInBuffer on nsd 0x7f5740013c70:
pszRcvBuf 0x7f57300008c0, lenRcvBuf -1
[... last two lines repeat with different nsd,pszRcvBuf values for a
while...]

It made about a million lines of this in a few seconds. I don't see
anything else in the debug log.

This time it took ~6 days to crash again. What else can I do to help?

thanks!
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to