On Tue, 25 Aug 2009, Rainer Gerhards wrote: > Date: Tue, 25 Aug 2009 16:44:26 +0200 > From: Rainer Gerhards <[email protected]> > Reply-To: rsyslog-users <[email protected]> > To: rsyslog-users <[email protected]> > Subject: Re: [rsyslog] abort in 4.2.1 > > Ok that is good info. I'll still standby for the debug log, but if that > doesn't show anything I'll probably look into crafting some small tools > to create a similiar environment. Do the malformed messages theselv come > in in burts (potentially without wellformed in between)?
the ones from the cron job definantly come in bursts, but even after I had them modify that script to make those messages well-formed I still had it die (at the moment I had them revert that script to assist in this debugging here is the tail of the debug log (with the messages themselves lightly sanitized) note that the debug log was _very_ large -rw-r--r-- 1 root root 2010546482 Aug 24 21:32 rsyslog.debug like the prior debugs, this dies on one of the malformed messages 9570.652786352:418d6950: msg parser: flags 30, from '192.168.242.15', msg '<5>iaalog[143336]: AIB|AAAAA|2009/08/24 17:12:48|mfa challenge|XXXXXXXXX|XXX.XX.XX.XXX|Challenge Question(s)|Challenge Presented|None|N/A|N/A|N/A' 9570.652794351:418d6950: Message has legacy syslog format. 9570.652803191:418d6950: Called action, logging to builtin-file 9570.652811270:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 9570.652820109:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 9570.652828309:418d6950: Action 0xc4e130 transitioned to state: itx 9570.652836228:418d6950: entering actionCalldoAction(), state: itx 9570.652845667:418d6950: file to log to: /var/log/messages 9570.652854067:418d6950: doWrite, pData->pStrm 0xc4f150, lenBuf 174 9570.652862546:418d6950: strm 0xc4f150: file 6 flush, buflen 174 9570.652875305:418d6950: strm 0xc4f150: file 6 write wrote 174 bytes 9570.652885664:418d6950: Action 0xc4e130 transitioned to state: rdy 9570.652893624:418d6950: action call returned 0 9570.652901623:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 9570.652909382:418d6950: XXXX: submitBatch got state 0 9570.652917182:418d6950: XXXX: submitBatch got state 0 9570.652924941:418d6950: XXXX: submitBatch pre while state 0 9570.652932941:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 9570.652941060:418d6950: XXXX: qAddDirect returns 0 9570.652948899:418d6950: XXXX: queueEnqObj returns 0 9570.652956699:418d6950: XXXX: queueEnqObj returned 0 9570.652964498:418d6950: XXXX: processMsgDoActions returns 0 9570.652972338:418d6950: XXXX: rule.processMsg returns 0 9570.652980017:418d6950: XXXX: pcoessMsgDoRules returns 0 9570.652988096:418d6950: Called action, logging to builtin-fwd 9570.652996056:418d6950: XXXX: ENTER tryDoAction elt 0 state 0 9570.653004895:418d6950: submitBatch: i:0, batch size 1, to process 1, pMsg: 0xc87970, state 0 9570.653013055:418d6950: Action 0xc4e680 transitioned to state: itx 9570.653021014:418d6950: entering actionCalldoAction(), state: itx 9570.653030533:418d6950: 192.168.210.8:514/udp 9570.653045972:418d6950: Action 0xc4e680 transitioned to state: rdy 9570.653054811:418d6950: action call returned 0 9570.653063051:418d6950: XXXX: done tryDoAction elt 0 state 0, iret 0 9570.653071050:418d6950: XXXX: submitBatch got state 0 9570.653079010:418d6950: XXXX: submitBatch got state 0 9570.653087009:418d6950: XXXX: submitBatch pre while state 0 9570.653095888:418d6950: XXXX: END submitBatch elt 0 state 0, iRet 0 9570.653104368:418d6950: XXXX: qAddDirect returns 0 9570.653112367:418d6950: XXXX: queueEnqObj returns 0 9570.653120446:418d6950: XXXX: queueEnqObj returned 0 9570.653128446:418d6950: XXXX: processMsgDoActions returns 0 9570.653136525:418d6950: XXXX: rule.processMsg returns 0 9570.653144445:418d6950: XXXX: pcoessMsgDoRules returns 0 9570.653152484:418d6950: XXXX: processMsg got return state 0 9570.653160723:418d6950: msgConsumer processes msg 28/32 9570.653168803:418d6950: dropped NUL at very end of message 9570.653352789:430d9950: recv(4,76)/192.168.242.15,acl:1,msg:<5>iaalog[143336]: AIB|AAAA|2009/08/24 17:17:07|account summary|XXXXXXXXX 9570.653367348:430d9950: main Q: entry added, size now log 186, phys 218 entries 9570.653386266:430d9950: XXXX: queueEnqObj returns 0 9570.653394706:430d9950: main Q: EnqueueMsg advised worker start 9570.653407625:430d9950: Listening on UDP syslogd socket 4 (IPv4/port 514). 9570.653416024:430d9950: --------imUDP calling select, active file descriptors (max 4): 4 > rainer > > ----- Urspr?ngliche Nachricht ----- > Von: "[email protected]" <[email protected]> > An: "rsyslog-users" <[email protected]> > Gesendet: 25.08.09 16:20 > Betreff: Re: [rsyslog] abort in 4.2.1 > > On Tue, 25 Aug 2009, Rainer Gerhards wrote: > >> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote: >>>> I'm testing to see if it has the problem I reported with 4.2.1 where it >>>> dies >>>> under load from malformed messages. >>> >>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >>> just be that the race condition to cause the crash is smaller, 5.x is >>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >>> writing them locally and relaying them to another machine eats up <2% cpu >>> according to top) >>> >>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) >> >> The bad thing about debug mode is that not only it is slower, but it >> introduces some synchronization. So race bugs frequently disappear when >> debug mode is turned on. Anyhow, sometimes they persist and then the >> debug log often provides good information (aka "definitely worth a >> try" ;)). >> >> I did some basic testing with the malformed message you provided in an >> earlier message, but I unfortunately did not see anything that is not >> clean. I am still a bit of the assumption that the malformednes of the >> message is not a necessary condition for the segfault - but that needs >> to be seen. No abort happened (yet) in my lab. > > I did finally get it to die, as soon as I get into the office I'll look at > the end of the debug log > > the box I am duplicating this problem on relays all the logs it recieves > up to another central box. the logs that come through this box are about a > tenth of the total logs that the central box gets, and that central box > has had no problems. > > the things that I see as being different are > > 1. the central box doesn't see the malformed messages (one of the relay > boxes would fix that before forwarding it) > > 2. there are fewer systems sending simultaniously to the central box > (there are ~100 boxes sending to the relay that dies, but only a half > dozen relay boxes sending to the central box) > > two of the other relays handle a _far_ higher rate of logs, but from fewer > sources (one has one source that spews ~15G of logs/day, the other > recieves ~100m logs/day from 6 machines). a third relay has more machines > sending it logs, but at a lower rate than those two (but still > significantly higher than the one that fails). if there was a problem with > load or the number of messages being recieved simultaniously I would > expect one of these other three to have more problems than the one that > fails on me. > > 3. a noticable fraction of the logs sent through this relay box are sent > by a cron job running on each of ~60 machines that wakes up every min and > scrapes a local file, sending all the pending messages, so the incoming > messages are a bit burstier than normal, the relaying is still bursty, but > it is only one bursty box, not many > > note that even if this cron job is stopped I still had 4.2.1 die on this > relay box, so I don't think that it's the bursty nature of the traffic > > this is why I'm suspicious of the malformed message handling > > David Lang > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

