On Tue, 25 Aug 2009, Rainer Gerhards wrote:

> Date: Tue, 25 Aug 2009 16:44:26 +0200
> From: Rainer Gerhards <[email protected]>
> Reply-To: rsyslog-users <[email protected]>
> To: rsyslog-users <[email protected]>
> Subject: Re: [rsyslog] abort in 4.2.1
> 
> Ok that is good info. I'll still standby for the debug log, but if that 
> doesn't show anything I'll probably look into crafting some small tools 
> to create a similiar environment. Do the malformed messages theselv come 
> in in burts (potentially without wellformed in between)?

the ones from the cron job definantly come in bursts, but even after I had 
them modify that script to make those messages well-formed I still had it 
die (at the moment I had them revert that script to assist in this 
debugging

here is the tail of the debug log (with the messages themselves lightly 
sanitized)

note that the debug log was _very_ large

-rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug

like the prior debugs, this dies on one of the malformed messages

9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg 
'<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa 
challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge 
Presented|None|N/A|N/A|N/A'
9570.652794351:418d6950: Message has legacy syslog format.
9570.652803191:418d6950: Called action, logging to builtin-file
9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0
9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 
0xc87970, state 0
9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx
9570.652836228:418d6950: entering actionCalldoAction(), state: itx
9570.652845667:418d6950: file to log to: /var/log/messages
9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174
9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174
9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes
9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy
9570.652893624:418d6950: action call returned 0
9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0
9570.652909382:418d6950: XXXX: submitBatch got state 0
9570.652917182:418d6950: XXXX: submitBatch got state 0
9570.652924941:418d6950: XXXX: submitBatch pre while state 0
9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0
9570.652941060:418d6950: XXXX: qAddDirect returns 0
9570.652948899:418d6950: XXXX: queueEnqObj returns  0
9570.652956699:418d6950: XXXX: queueEnqObj returned 0
9570.652964498:418d6950: XXXX: processMsgDoActions returns 0
9570.652972338:418d6950: XXXX: rule.processMsg returns 0
9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0
9570.652988096:418d6950: Called action, logging to builtin-fwd
9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0
9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 
0xc87970, state 0
9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx
9570.653021014:418d6950: entering actionCalldoAction(), state: itx
9570.653030533:418d6950:  192.168.210.8:514/udp
9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy
9570.653054811:418d6950: action call returned 0
9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0
9570.653071050:418d6950: XXXX: submitBatch got state 0
9570.653079010:418d6950: XXXX: submitBatch got state 0
9570.653087009:418d6950: XXXX: submitBatch pre while state 0
9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0
9570.653104368:418d6950: XXXX: qAddDirect returns 0
9570.653112367:418d6950: XXXX: queueEnqObj returns  0
9570.653120446:418d6950: XXXX: queueEnqObj returned 0
9570.653128446:418d6950: XXXX: processMsgDoActions returns 0
9570.653136525:418d6950: XXXX: rule.processMsg returns 0
9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0
9570.653152484:418d6950: XXXX: processMsg got return state 0
9570.653160723:418d6950: msgConsumer processes msg 28/32
9570.653168803:418d6950: dropped NUL at very end of message
9570.653352789:430d9950: 
recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 
17:17:07|account summary|XXXXXXXXX

9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries
9570.653386266:430d9950: XXXX: queueEnqObj returns  0
9570.653394706:430d9950: main Q: EnqueueMsg advised worker start
9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514).
9570.653416024:430d9950: --------imUDP calling select, active file descriptors 
(max 4): 4

> rainer
>
> ----- Urspr?ngliche Nachricht -----
> Von: "[email protected]" <[email protected]>
> An: "rsyslog-users" <[email protected]>
> Gesendet: 25.08.09 16:20
> Betreff: Re: [rsyslog] abort in 4.2.1
>
> On Tue, 25 Aug 2009, Rainer Gerhards wrote:
>
>> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote:
>>>> I'm testing to see if it has the problem I reported with 4.2.1 where it 
>>>> dies
>>>> under load from malformed messages.
>>>
>>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may
>>> just be that the race condition to cause the crash is smaller, 5.x is
>>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec,
>>> writing them locally and relaying them to another machine eats up <2% cpu
>>> according to top)
>>>
>>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu)
>>
>> The bad thing about debug mode is that not only it is slower, but it
>> introduces some synchronization. So race bugs frequently disappear when
>> debug mode is turned on. Anyhow, sometimes they persist and then the
>> debug log often provides good information (aka "definitely worth a
>> try" ;)).
>>
>> I did some basic testing with the malformed message you provided in an
>> earlier message, but I unfortunately did not see anything that is not
>> clean. I am still a bit of the assumption that the malformednes of the
>> message is not a necessary condition for the segfault - but that needs
>> to be seen. No abort happened (yet) in my lab.
>
> I did finally get it to die, as soon as I get into the office I'll look at 
> the end of the debug log
>
> the box I am duplicating this problem on relays all the logs it recieves 
> up to another central box. the logs that come through this box are about a 
> tenth of the total logs that the central box gets, and that central box 
> has had no problems.
>
> the things that I see as being different are
>
> 1. the central box doesn't see the malformed messages (one of the relay 
> boxes would fix that before forwarding it)
>
> 2. there are fewer systems sending simultaniously to the central box 
> (there are ~100 boxes sending to the relay that dies, but only a half 
> dozen relay boxes sending to the central box)
>
> two of the other relays handle a _far_ higher rate of logs, but from fewer 
> sources (one has one source that spews ~15G of logs/day, the other 
> recieves ~100m logs/day from 6 machines). a third relay has more machines 
> sending it logs, but at a lower rate than those two (but still 
> significantly higher than the one that fails). if there was a problem with 
> load or the number of messages being recieved simultaniously I would 
> expect one of these other three to have more problems than the one that 
> fails on me.
>
> 3. a noticable fraction of the logs sent through this relay box are sent 
> by a cron job running on each of ~60 machines that wakes up every min and 
> scrapes a local file, sending all the pending messages, so the incoming 
> messages are a bit burstier than normal, the relaying is still bursty, but 
> it is only one bursty box, not many
>
> note that even if this cron job is stopped I still had 4.2.1 die on this 
> relay box, so I don't think that it's the bursty nature of the traffic
>
> this is why I'm suspicious of the malformed message handling
>
> David Lang
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to