Excellent! This gives me sometging to work with. I could well envision that there is some quirck at that location - will do code review...
rainer ----- Ursprüngliche Nachricht ----- Von: "[email protected]" <[email protected]> An: "rsyslog-users" <[email protected]> Gesendet: 25.08.09 23:57 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > First shot at it: > > 1. Make sure core dump is written (ulimit -c 999999999) > 2. Have it abort > 3. bdb /path/to/binary/tsyslogd > 4. Core name-of-corefile (usually /core.SOMENBR) > 5. Enter: bt (for backtrace) > 6. Enter: info thread (displays threads) > 7. For each thread: > 7a. Thread number > 7b. Bt > 8. You are done (ctl-d) Core was generated by `rsyslogd -c5 -x'. Program terminated with signal 11, Segmentation fault. [New process 11534] [New process 11538] [New process 11535] [New process 11537] [New process 11533] [New process 11536] #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) bt #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260 #2 0x000000000040b6fc in msgConsumer (notNeeded=<value optimized out>, pBatch=0xa2af98) at syslogd.c:942 #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at queue.c:1818 #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #7 0x00007f313de545ad in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) info threads 6 process 11536 0x00007f313de4dce2 in select () from /lib/libc.so.6 5 process 11533 0x00007f313de4dce2 in select () from /lib/libc.so.6 4 process 11537 0x00007f313e4f27db in read () from /lib/libpthread.so.0 3 process 11535 0x00007f313de4dce2 in select () from /lib/libc.so.6 2 process 11538 0x00007f313de4dce2 in select () from /lib/libc.so.6 * 1 process 11534 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 (gdb) thread 1 [Switching to thread 1 (process 11534)]#0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 222 if(pszMsg[iSrc] == '\0') { /* guard against \0 characters... */ (gdb) bt #0 sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222 #1 0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260 #2 0x000000000040b6fc in msgConsumer (notNeeded=<value optimized out>, pBatch=0xa2af98) at syslogd.c:942 #3 0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at queue.c:1818 #4 0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276 #5 0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349 #6 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #7 0x00007f313de545ad in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) thread 2 [Switching to thread 2 (process 11538)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x00007f313d1673b3 in ?? () from /usr/local/lib/rsyslog/imudp.so #2 0x000000000043407d in thrdStarter (arg=0x7f312c000dd0) at ../threads.c:157 #3 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00007f313de545ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (process 11535)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x0000000000433f0a in thrdSleep (pThis=0x7f312c0008c0, iSeconds=<value optimized out>, iuSeconds=<value optimized out>) at ../threads.c:230 #2 0x00007f313d7739a3 in ?? () from /usr/local/lib/rsyslog/immark.so #3 0x000000000043407d in thrdStarter (arg=0x7f312c0008c0) at ../threads.c:157 #4 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #5 0x00007f313de545ad in clone () from /lib/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 4 [Switching to thread 4 (process 11537)]#0 0x00007f313e4f27db in read () from /lib/libpthread.so.0 (gdb) bt #0 0x00007f313e4f27db in read () from /lib/libpthread.so.0 #1 0x00007f313d36bdc7 in klogLogKMsg () from /usr/local/lib/rsyslog/imklog.so #2 0x00007f313d36b29c in ?? () from /usr/local/lib/rsyslog/imklog.so #3 0x000000000043407d in thrdStarter (arg=0x7f312c000c20) at ../threads.c:157 #4 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #5 0x00007f313de545ad in clone () from /lib/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 5 [Switching to thread 5 (process 11533)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x000000000040d55a in mainThread () at syslogd.c:2520 #2 0x000000000040ec1d in realMain (argc=<value optimized out>, argv=0x0) at syslogd.c:3436 #3 0x00007f313dda31a6 in __libc_start_main () from /lib/libc.so.6 #4 0x000000000040ab49 in _start () (gdb) thread 6 [Switching to thread 6 (process 11536)]#0 0x00007f313de4dce2 in select () from /lib/libc.so.6 (gdb) bt #0 0x00007f313de4dce2 in select () from /lib/libc.so.6 #1 0x00007f313d5716f0 in ?? () from /usr/local/lib/rsyslog/imuxsock.so #2 0x000000000043407d in thrdStarter (arg=0x7f312c000a70) at ../threads.c:157 #3 0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0 #4 0x00007f313de545ad in clone () from /lib/libc.so.6 #5 0x0000000000000000 in ?? () > Step 7 is necessary because the default bt does not necessarily point to the > abort thread (some times it does, some times not...) > > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "[email protected]" <[email protected]> > An: "rsyslog-users" <[email protected]> > Gesendet: 25.08.09 17:59 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> Mmhhh... Unfortunately, this does not show anything immediately obvious. >> Could you provide me with a gdb backtrace of the abort? Knowing where it >> aborted often helps... > > I don't know how to do this. > > David Lang > >> rainer >> >> ----- Urspr?ngliche Nachricht ----- >> Von: "[email protected]" <[email protected]> >> An: "rsyslog-users" <[email protected]> >> Gesendet: 25.08.09 17:16 >> Betreff: Re: [rsyslog] abort in 4.2.1 >> >> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >> >>> Date: Tue, 25 Aug 2009 16:44:26 +0200 >>> From: Rainer Gerhards <[email protected]> >>> Reply-To: rsyslog-users <[email protected]> >>> To: rsyslog-users <[email protected]> >>> Subject: Re: [rsyslog] abort in 4.2.1 >>> >>> Ok that is good info. I'll still standby for the debug log, but if that >>> doesn't show anything I'll probably look into crafting some small tools >>> to create a similiar environment. Do the malformed messages theselv come >>> in in burts (potentially without wellformed in between)? >> >> the ones from the cron job definantly come in bursts, but even after I had >> them modify that script to make those messages well-formed I still had it >> die (at the moment I had them revert that script to assist in this >> debugging >> >> here is the tail of the debug log (with the messages themselves lightly >> sanitized) >> >> note that the debug log was _very_ large >> >> -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug >> >> like the prior debugs, this dies on one of the malformed messages >> >> 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg >> '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa >> challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge >> Presented|None|N/A|N/A|N/A' >> 9570.652794351:418d6950: Message has legacy syslog format. >> 9570.652803191:418d6950: Called action, logging to builtin-file >> 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 >> 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: >> 0xc87970, state 0 >> 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx >> 9570.652836228:418d6950: entering actionCalldoAction(), state: itx >> 9570.652845667:418d6950: file to log to: /var/log/messages >> 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 >> 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 >> 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes >> 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy >> 9570.652893624:418d6950: action call returned 0 >> 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 >> 9570.652909382:418d6950: XXXX: submitBatch got state 0 >> 9570.652917182:418d6950: XXXX: submitBatch got state 0 >> 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 >> 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 >> 9570.652941060:418d6950: XXXX: qAddDirect returns 0 >> 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 >> 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 >> 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 >> 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 >> 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 >> 9570.652988096:418d6950: Called action, logging to builtin-fwd >> 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 >> 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: >> 0xc87970, state 0 >> 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx >> 9570.653021014:418d6950: entering actionCalldoAction(), state: itx >> 9570.653030533:418d6950: 192.168.210.8:514/udp >> 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy >> 9570.653054811:418d6950: action call returned 0 >> 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 >> 9570.653071050:418d6950: XXXX: submitBatch got state 0 >> 9570.653079010:418d6950: XXXX: submitBatch got state 0 >> 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 >> 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 >> 9570.653104368:418d6950: XXXX: qAddDirect returns 0 >> 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 >> 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 >> 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 >> 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 >> 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 >> 9570.653152484:418d6950: XXXX: processMsg got return state 0 >> 9570.653160723:418d6950: msgConsumer processes msg 28/32 >> 9570.653168803:418d6950: dropped NUL at very end of message >> 9570.653352789:430d9950: >> recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 >> 17:17:07|account summary|XXXXXXXXX >> >> 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 >> entries >> 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 >> 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start >> 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). >> 9570.653416024:430d9950: --------imUDP calling select, active file >> descriptors (max 4): 4 >> >>> rainer >>> >>> ----- Urspr?ngliche Nachricht ----- >>> Von: "[email protected]" <[email protected]> >>> An: "rsyslog-users" <[email protected]> >>> Gesendet: 25.08.09 16:20 >>> Betreff: Re: [rsyslog] abort in 4.2.1 >>> >>> On Tue, 25 Aug 2009, Rainer Gerhards wrote: >>> >>>> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote: >>>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it >>>>>> dies >>>>>> under load from malformed messages. >>>>> >>>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>>>> just be that the race condition to cause the crash is smaller, 5.x is >>>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>>>> writing them locally and relaying them to another machine eats up <2% cpu >>>>> according to top) >>>>> >>>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >>>> >>>> The bad thing about debug mode is that not only it is slower, but it >>>> introduces some synchronization. So race bugs frequently disappear when >>>> debug mode is turned on. Anyhow, sometimes they persist and then the >>>> debug log often provides good information (aka "definitely worth a >>>> try" ;)). >>>> >>>> I did some basic testing with the malformed message you provided in an >>>> earlier message, but I unfortunately did not see anything that is not >>>> clean. I am still a bit of the assumption that the malformednes of the >>>> message is not a necessary condition for the segfault - but that needs >>>> to be seen. No abort happened (yet) in my lab. >>> >>> I did finally get it to die, as soon as I get into the office I'll look at >>> the end of the debug log >>> >>> the box I am duplicating this problem on relays all the logs it recieves >>> up to another central box. the logs that come through this box are about a >>> tenth of the total logs that the central box gets, and that central box >>> has had no problems. >>> >>> the things that I see as being different are >>> >>> 1. the central box doesn't see the malformed messages (one of the relay >>> boxes would fix that before forwarding it) >>> >>> 2. there are fewer systems sending simultaniously to the central box >>> (there are ~100 boxes sending to the relay that dies, but only a half >>> dozen relay boxes sending to the central box) >>> >>> two of the other relays handle a _far_ higher rate of logs, but from fewer >>> sources (one has one source that spews ~15G of logs/day, the other >>> recieves ~100m logs/day from 6 machines). a third relay has more machines >>> sending it logs, but at a lower rate than those two (but still >>> significantly higher than the one that fails). if there was a problem with >>> load or the number of messages being recieved simultaniously I would >>> expect one of these other three to have more problems than the one that >>> fails on me. >>> >>> 3. a noticable fraction of the logs sent through this relay box are sent >>> by a cron job running on each of ~60 machines that wakes up every min and >>> scrapes a local file, sending all the pending messages, so the incoming >>> messages are a bit burstier than normal, the relaying is still bursty, but >>> it is only one bursty box, not many >>> >>> note that even if this cron job is stopped I still had 4.2.1 die on this >>> relay box, so I don't think that it's the bursty nature of the traffic >>> >>> this is why I'm suspicious of the malformed message handling >>> >>> David Lang >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >>> _______________________________________________ >>> rsyslog mailing list >>> http://lists.adiscon.net/mailman/listinfo/rsyslog >>> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com >> _______________________________________________ >> rsyslog mailing list >> http://lists.adiscon.net/mailman/listinfo/rsyslog >> http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

