Re: [rsyslog] abort in 4.2.1

Rainer Gerhards Tue, 25 Aug 2009 23:18:11 -0700

Excellent! This gives me sometging to work with. I could well envision that 
there is some quirck at that location - will do code review...


rainer

----- Ursprüngliche Nachricht -----
Von: "[email protected]" <[email protected]>
An: "rsyslog-users" <[email protected]>
Gesendet: 25.08.09 23:57
Betreff: Re: [rsyslog] abort in 4.2.1

On Tue, 25 Aug 2009, Rainer Gerhards wrote:

> First shot at it:
>
> 1. Make sure core dump is written (ulimit -c 999999999)
> 2. Have it abort
> 3. bdb /path/to/binary/tsyslogd
> 4. Core name-of-corefile (usually /core.SOMENBR)
> 5. Enter: bt (for backtrace)
> 6. Enter: info thread (displays threads)
> 7. For each thread:
> 7a. Thread number
> 7b. Bt
> 8. You are done (ctl-d)

Core was generated by `rsyslogd -c5 -x'.
Program terminated with signal 11, Segmentation fault.
[New process 11534]
[New process 11538]
[New process 11535]
[New process 11537]
[New process 11533]
[New process 11536]
#0  sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222
222                     if(pszMsg[iSrc] == '\0') { /* guard against \0 
characters... */
(gdb) bt
#0  sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222
#1  0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260
#2  0x000000000040b6fc in msgConsumer (notNeeded=<value optimized out>, 
pBatch=0xa2af98) at syslogd.c:942
#3  0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at 
queue.c:1818
#4  0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276
#5  0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349
#6  0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0
#7  0x00007f313de545ad in clone () from /lib/libc.so.6
#8  0x0000000000000000 in ?? ()
(gdb) info threads
   6 process 11536  0x00007f313de4dce2 in select () from /lib/libc.so.6
   5 process 11533  0x00007f313de4dce2 in select () from /lib/libc.so.6
   4 process 11537  0x00007f313e4f27db in read () from /lib/libpthread.so.0
   3 process 11535  0x00007f313de4dce2 in select () from /lib/libc.so.6
   2 process 11538  0x00007f313de4dce2 in select () from /lib/libc.so.6
* 1 process 11534  sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222
(gdb) thread 1
[Switching to thread 1 (process 11534)]#0  sanitizeMessage 
(pMsg=0x7f312c001530) at parser.c:222
222                     if(pszMsg[iSrc] == '\0') { /* guard against \0 
characters... */
(gdb) bt
#0  sanitizeMessage (pMsg=0x7f312c001530) at parser.c:222
#1  0x00000000004161f1 in parseMsg (pMsg=0x7f312c001530) at parser.c:260
#2  0x000000000040b6fc in msgConsumer (notNeeded=<value optimized out>, 
pBatch=0xa2af98) at syslogd.c:942
#3  0x000000000042df9e in ConsumerReg (pThis=0xa30b00, pWti=0xa2af70) at 
queue.c:1818
#4  0x0000000000428220 in wtiWorker (pThis=0xa2af70) at wti.c:276
#5  0x00000000004279ac in wtpWorker (arg=0xa2af70) at wtp.c:349
#6  0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0
#7  0x00007f313de545ad in clone () from /lib/libc.so.6
#8  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (process 11538)]#0  0x00007f313de4dce2 in select () from 
/lib/libc.so.6
(gdb) bt
#0  0x00007f313de4dce2 in select () from /lib/libc.so.6
#1  0x00007f313d1673b3 in ?? () from /usr/local/lib/rsyslog/imudp.so
#2  0x000000000043407d in thrdStarter (arg=0x7f312c000dd0) at ../threads.c:157
#3  0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0
#4  0x00007f313de545ad in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (process 11535)]#0  0x00007f313de4dce2 in select () from 
/lib/libc.so.6
(gdb) bt
#0  0x00007f313de4dce2 in select () from /lib/libc.so.6
#1  0x0000000000433f0a in thrdSleep (pThis=0x7f312c0008c0, iSeconds=<value 
optimized out>, iuSeconds=<value optimized out>)
     at ../threads.c:230
#2  0x00007f313d7739a3 in ?? () from /usr/local/lib/rsyslog/immark.so
#3  0x000000000043407d in thrdStarter (arg=0x7f312c0008c0) at ../threads.c:157
#4  0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0
#5  0x00007f313de545ad in clone () from /lib/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (process 11537)]#0  0x00007f313e4f27db in read () from 
/lib/libpthread.so.0
(gdb) bt
#0  0x00007f313e4f27db in read () from /lib/libpthread.so.0
#1  0x00007f313d36bdc7 in klogLogKMsg () from /usr/local/lib/rsyslog/imklog.so
#2  0x00007f313d36b29c in ?? () from /usr/local/lib/rsyslog/imklog.so
#3  0x000000000043407d in thrdStarter (arg=0x7f312c000c20) at ../threads.c:157
#4  0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0
#5  0x00007f313de545ad in clone () from /lib/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (process 11533)]#0  0x00007f313de4dce2 in select () from 
/lib/libc.so.6
(gdb) bt
#0  0x00007f313de4dce2 in select () from /lib/libc.so.6
#1  0x000000000040d55a in mainThread () at syslogd.c:2520
#2  0x000000000040ec1d in realMain (argc=<value optimized out>, argv=0x0) at 
syslogd.c:3436
#3  0x00007f313dda31a6 in __libc_start_main () from /lib/libc.so.6
#4  0x000000000040ab49 in _start ()
(gdb) thread 6
[Switching to thread 6 (process 11536)]#0  0x00007f313de4dce2 in select () from 
/lib/libc.so.6
(gdb) bt
#0  0x00007f313de4dce2 in select () from /lib/libc.so.6
#1  0x00007f313d5716f0 in ?? () from /usr/local/lib/rsyslog/imuxsock.so
#2  0x000000000043407d in thrdStarter (arg=0x7f312c000a70) at ../threads.c:157
#3  0x00007f313e4ebfc7 in start_thread () from /lib/libpthread.so.0
#4  0x00007f313de545ad in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()





> Step 7 is necessary because the default bt does not necessarily point to the 
> abort thread (some times it does, some times not...)
>
> rainer
>
> ----- Urspr?ngliche Nachricht -----
> Von: "[email protected]" <[email protected]>
> An: "rsyslog-users" <[email protected]>
> Gesendet: 25.08.09 17:59
> Betreff: Re: [rsyslog] abort in 4.2.1
>
> On Tue, 25 Aug 2009, Rainer Gerhards wrote:
>
>> Mmhhh... Unfortunately, this does not show anything immediately obvious. 
>> Could you provide me with a gdb backtrace of the abort? Knowing where it 
>> aborted often helps...
>
> I don't know how to do this.
>
> David Lang
>
>> rainer
>>
>> ----- Urspr?ngliche Nachricht -----
>> Von: "[email protected]" <[email protected]>
>> An: "rsyslog-users" <[email protected]>
>> Gesendet: 25.08.09 17:16
>> Betreff: Re: [rsyslog] abort in 4.2.1
>>
>> On Tue, 25 Aug 2009, Rainer Gerhards wrote:
>>
>>> Date: Tue, 25 Aug 2009 16:44:26 +0200
>>> From: Rainer Gerhards <[email protected]>
>>> Reply-To: rsyslog-users <[email protected]>
>>> To: rsyslog-users <[email protected]>
>>> Subject: Re: [rsyslog] abort in 4.2.1
>>> 
>>> Ok that is good info. I'll still standby for the debug log, but if that 
>>> doesn't show anything I'll probably look into crafting some small tools 
>>> to create a similiar environment. Do the malformed messages theselv come 
>>> in in burts (potentially without wellformed in between)?
>>
>> the ones from the cron job definantly come in bursts, but even after I had 
>> them modify that script to make those messages well-formed I still had it 
>> die (at the moment I had them revert that script to assist in this 
>> debugging
>>
>> here is the tail of the debug log (with the messages themselves lightly 
>> sanitized)
>>
>> note that the debug log was _very_ large
>>
>> -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug
>>
>> like the prior debugs, this dies on one of the malformed messages
>>
>> 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg 
>> '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa 
>> challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge 
>> Presented|None|N/A|N/A|N/A'
>> 9570.652794351:418d6950: Message has legacy syslog format.
>> 9570.652803191:418d6950: Called action, logging to builtin-file
>> 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0
>> 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 
>> 0xc87970, state 0
>> 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx
>> 9570.652836228:418d6950: entering actionCalldoAction(), state: itx
>> 9570.652845667:418d6950: file to log to: /var/log/messages
>> 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174
>> 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174
>> 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes
>> 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy
>> 9570.652893624:418d6950: action call returned 0
>> 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0
>> 9570.652909382:418d6950: XXXX: submitBatch got state 0
>> 9570.652917182:418d6950: XXXX: submitBatch got state 0
>> 9570.652924941:418d6950: XXXX: submitBatch pre while state 0
>> 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0
>> 9570.652941060:418d6950: XXXX: qAddDirect returns 0
>> 9570.652948899:418d6950: XXXX: queueEnqObj returns  0
>> 9570.652956699:418d6950: XXXX: queueEnqObj returned 0
>> 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0
>> 9570.652972338:418d6950: XXXX: rule.processMsg returns 0
>> 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0
>> 9570.652988096:418d6950: Called action, logging to builtin-fwd
>> 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0
>> 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 
>> 0xc87970, state 0
>> 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx
>> 9570.653021014:418d6950: entering actionCalldoAction(), state: itx
>> 9570.653030533:418d6950:  192.168.210.8:514/udp
>> 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy
>> 9570.653054811:418d6950: action call returned 0
>> 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0
>> 9570.653071050:418d6950: XXXX: submitBatch got state 0
>> 9570.653079010:418d6950: XXXX: submitBatch got state 0
>> 9570.653087009:418d6950: XXXX: submitBatch pre while state 0
>> 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0
>> 9570.653104368:418d6950: XXXX: qAddDirect returns 0
>> 9570.653112367:418d6950: XXXX: queueEnqObj returns  0
>> 9570.653120446:418d6950: XXXX: queueEnqObj returned 0
>> 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0
>> 9570.653136525:418d6950: XXXX: rule.processMsg returns 0
>> 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0
>> 9570.653152484:418d6950: XXXX: processMsg got return state 0
>> 9570.653160723:418d6950: msgConsumer processes msg 28/32
>> 9570.653168803:418d6950: dropped NUL at very end of message
>> 9570.653352789:430d9950: 
>> recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 
>> 17:17:07|account summary|XXXXXXXXX
>>
>> 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 
>> entries
>> 9570.653386266:430d9950: XXXX: queueEnqObj returns  0
>> 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start
>> 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514).
>> 9570.653416024:430d9950: --------imUDP calling select, active file 
>> descriptors (max 4): 4
>>
>>> rainer
>>>
>>> ----- Urspr?ngliche Nachricht -----
>>> Von: "[email protected]" <[email protected]>
>>> An: "rsyslog-users" <[email protected]>
>>> Gesendet: 25.08.09 16:20
>>> Betreff: Re: [rsyslog] abort in 4.2.1
>>>
>>> On Tue, 25 Aug 2009, Rainer Gerhards wrote:
>>>
>>>> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote:
>>>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it 
>>>>>> dies
>>>>>> under load from malformed messages.
>>>>>
>>>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may
>>>>> just be that the race condition to cause the crash is smaller, 5.x is
>>>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec,
>>>>> writing them locally and relaying them to another machine eats up <2% cpu
>>>>> according to top)
>>>>>
>>>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu)
>>>>
>>>> The bad thing about debug mode is that not only it is slower, but it
>>>> introduces some synchronization. So race bugs frequently disappear when
>>>> debug mode is turned on. Anyhow, sometimes they persist and then the
>>>> debug log often provides good information (aka "definitely worth a
>>>> try" ;)).
>>>>
>>>> I did some basic testing with the malformed message you provided in an
>>>> earlier message, but I unfortunately did not see anything that is not
>>>> clean. I am still a bit of the assumption that the malformednes of the
>>>> message is not a necessary condition for the segfault - but that needs
>>>> to be seen. No abort happened (yet) in my lab.
>>>
>>> I did finally get it to die, as soon as I get into the office I'll look at 
>>> the end of the debug log
>>>
>>> the box I am duplicating this problem on relays all the logs it recieves 
>>> up to another central box. the logs that come through this box are about a 
>>> tenth of the total logs that the central box gets, and that central box 
>>> has had no problems.
>>>
>>> the things that I see as being different are
>>>
>>> 1. the central box doesn't see the malformed messages (one of the relay 
>>> boxes would fix that before forwarding it)
>>>
>>> 2. there are fewer systems sending simultaniously to the central box 
>>> (there are ~100 boxes sending to the relay that dies, but only a half 
>>> dozen relay boxes sending to the central box)
>>>
>>> two of the other relays handle a _far_ higher rate of logs, but from fewer 
>>> sources (one has one source that spews ~15G of logs/day, the other 
>>> recieves ~100m logs/day from 6 machines). a third relay has more machines 
>>> sending it logs, but at a lower rate than those two (but still 
>>> significantly higher than the one that fails). if there was a problem with 
>>> load or the number of messages being recieved simultaniously I would 
>>> expect one of these other three to have more problems than the one that 
>>> fails on me.
>>>
>>> 3. a noticable fraction of the logs sent through this relay box are sent 
>>> by a cron job running on each of ~60 machines that wakes up every min and 
>>> scrapes a local file, sending all the pending messages, so the incoming 
>>> messages are a bit burstier than normal, the relaying is still bursty, but 
>>> it is only one bursty box, not many
>>>
>>> note that even if this cron job is stopped I still had 4.2.1 die on this 
>>> relay box, so I don't think that it's the bursty nature of the traffic
>>>
>>> this is why I'm suspicious of the malformed message handling
>>>
>>> David Lang
>>> _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com
>>> _______________________________________________
>>> rsyslog mailing list
>>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>>> http://www.rsyslog.com
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] abort in 4.2.1

Reply via email to