Ok that is good info. I'll still standby for the debug log, but if that doesn't show anything I'll probably look into crafting some small tools to create a similiar environment. Do the malformed messages theselv come in in burts (potentially without wellformed in between)?
rainer ----- Ursprüngliche Nachricht ----- Von: "[email protected]" <[email protected]> An: "rsyslog-users" <[email protected]> Gesendet: 25.08.09 16:20 Betreff: Re: [rsyslog] abort in 4.2.1 On Tue, 25 Aug 2009, Rainer Gerhards wrote: > On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote: >>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies >>> under load from malformed messages. >> >> It finally died just like 4.2.1 did. It took a _lot_ longer (which may >> just be that the race condition to cause the crash is smaller, 5.x is >> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec, >> writing them locally and relaying them to another machine eats up <2% cpu >> according to top) >> >> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu) > > The bad thing about debug mode is that not only it is slower, but it > introduces some synchronization. So race bugs frequently disappear when > debug mode is turned on. Anyhow, sometimes they persist and then the > debug log often provides good information (aka "definitely worth a > try" ;)). > > I did some basic testing with the malformed message you provided in an > earlier message, but I unfortunately did not see anything that is not > clean. I am still a bit of the assumption that the malformednes of the > message is not a necessary condition for the segfault - but that needs > to be seen. No abort happened (yet) in my lab. I did finally get it to die, as soon as I get into the office I'll look at the end of the debug log the box I am duplicating this problem on relays all the logs it recieves up to another central box. the logs that come through this box are about a tenth of the total logs that the central box gets, and that central box has had no problems. the things that I see as being different are 1. the central box doesn't see the malformed messages (one of the relay boxes would fix that before forwarding it) 2. there are fewer systems sending simultaniously to the central box (there are ~100 boxes sending to the relay that dies, but only a half dozen relay boxes sending to the central box) two of the other relays handle a _far_ higher rate of logs, but from fewer sources (one has one source that spews ~15G of logs/day, the other recieves ~100m logs/day from 6 machines). a third relay has more machines sending it logs, but at a lower rate than those two (but still significantly higher than the one that fails). if there was a problem with load or the number of messages being recieved simultaniously I would expect one of these other three to have more problems than the one that fails on me. 3. a noticable fraction of the logs sent through this relay box are sent by a cron job running on each of ~60 machines that wakes up every min and scrapes a local file, sending all the pending messages, so the incoming messages are a bit burstier than normal, the relaying is still bursty, but it is only one bursty box, not many note that even if this cron job is stopped I still had 4.2.1 die on this relay box, so I don't think that it's the bursty nature of the traffic this is why I'm suspicious of the malformed message handling David Lang _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

