Ok that is good info. I'll still standby for the debug log, but if that doesn't 
show anything I'll probably look into crafting some small tools to create a 
similiar environment. Do the malformed messages theselv come in in burts 
(potentially without wellformed in between)?

rainer

----- Ursprüngliche Nachricht -----
Von: "[email protected]" <[email protected]>
An: "rsyslog-users" <[email protected]>
Gesendet: 25.08.09 16:20
Betreff: Re: [rsyslog] abort in 4.2.1

On Tue, 25 Aug 2009, Rainer Gerhards wrote:

> On Mon, 2009-08-24 at 14:06 -0700, [email protected] wrote:
>>> I'm testing to see if it has the problem I reported with 4.2.1 where it dies
>>> under load from malformed messages.
>>
>> It finally died just like 4.2.1 did. It took a _lot_ longer (which may
>> just be that the race condition to cause the crash is smaller, 5.x is
>> _significantly_ more efficiant than 4.x is. processing ~1800 messages/sec,
>> writing them locally and relaying them to another machine eats up <2% cpu
>> according to top)
>>
>> I restarted it in debug mode (this takes more cpu, almost 10% of a cpu)
>
> The bad thing about debug mode is that not only it is slower, but it
> introduces some synchronization. So race bugs frequently disappear when
> debug mode is turned on. Anyhow, sometimes they persist and then the
> debug log often provides good information (aka "definitely worth a
> try" ;)).
>
> I did some basic testing with the malformed message you provided in an
> earlier message, but I unfortunately did not see anything that is not
> clean. I am still a bit of the assumption that the malformednes of the
> message is not a necessary condition for the segfault - but that needs
> to be seen. No abort happened (yet) in my lab.

I did finally get it to die, as soon as I get into the office I'll look at 
the end of the debug log

the box I am duplicating this problem on relays all the logs it recieves 
up to another central box. the logs that come through this box are about a 
tenth of the total logs that the central box gets, and that central box 
has had no problems.

the things that I see as being different are

1. the central box doesn't see the malformed messages (one of the relay 
boxes would fix that before forwarding it)

2. there are fewer systems sending simultaniously to the central box 
(there are ~100 boxes sending to the relay that dies, but only a half 
dozen relay boxes sending to the central box)

two of the other relays handle a _far_ higher rate of logs, but from fewer 
sources (one has one source that spews ~15G of logs/day, the other 
recieves ~100m logs/day from 6 machines). a third relay has more machines 
sending it logs, but at a lower rate than those two (but still 
significantly higher than the one that fails). if there was a problem with 
load or the number of messages being recieved simultaniously I would 
expect one of these other three to have more problems than the one that 
fails on me.

3. a noticable fraction of the logs sent through this relay box are sent 
by a cron job running on each of ~60 machines that wakes up every min and 
scrapes a local file, sending all the pending messages, so the incoming 
messages are a bit burstier than normal, the relaying is still bursty, but 
it is only one bursty box, not many

note that even if this cron job is stopped I still had 4.2.1 die on this 
relay box, so I don't think that it's the bursty nature of the traffic

this is why I'm suspicious of the malformed message handling

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to