On Fri, 2009-08-28 at 14:55 -0700, [email protected] wrote:
> On Fri, 28 Aug 2009, Rainer Gerhards wrote:
> > Also, it would be good if you could --enable-rtinst --enable-debug and try
> > out that version on your machine. I am a bit concerned about the speed of 
> > the
> > resulting executable, it may be too slow. You do not need to run it in debug
> > mode itself. These option (especially--enable-debug) will activate in-depth
> > runtime checks (assert, will abort when something wrong happens) and my hope
> > is that they will catch the bug closer to the root cause. If so, I would 
> > need
> > the gdb abort info (actually enabling debug output would be an option some
> > time later).
> >
> > Please let me know what would be OK with you.
> 
> I will give this a try.
> 
> I was going to suggest that since we have the message getting corrupted it 
> may make sense to make a temporary branch that has multiple message 
> buffers and at various times through the message processing it makes a 
> copy of the emssage to the buffer. when the system crashes I will be able 
> to look at the core and see where the message is getting corrupted.

David, I fear it is even more complicated than that. It looks like not
only the message got corrupted but the message object itself. There are
already two copies of some of the message elements, and they also look
inconsistent - except, if we really had a null message, that is one with
no content at all (and generating a message object from a null message,
I think, would be a bug in itself - but I am sure there are no such
messages in your actual traffic). If you think there could be a real
null message, I'd follow that path (will probably do so in any case...).

I think that what really happens is that some part of the code runs
wild, thus invalidating some random part of the main memory. At some
times, it hits queue structures (or the message object that is held by
them) and if so, we will see the abort you experience. With that
scenario, duplicating the message buffer does not really help, because
looking at the corrupted message object would not provide any additional
information.

However, if that's easy enough to reproduce, it would probably be good
if you could send me the core analysis (the backtrace and the print
statements) from a few (five maybe?) independent aborts. Maybe they show
a pattern. It would probably best to send them via private mail, as I am
not sure if they disclose more than they should.

> 
> I will see about doing a tcpdump at the time that I do this and send it to 
> you (I'll need to check with management, but since we have a contract in 
> place for other reasons I think we can do this)
> 

That would probably be a good thing. I've made some progress with my
testing tool, and I have created a basic version right now. Probably not
good enough to mimic your traffic pattern, but closer. I am doing a test
run for quite some time now, unfortunately so far without abort.

Note that I run into the trouble with UDP - even though I've put some
one-ms sleeps into the code, I lose a lot of messages, as it looks even
before they hit the wire. It's always real trobulesome to test with
UDP... 

Rainer
> I can't do this late on a friday, but I should be able to do this monday 
> afternoon.
> 
> David Lang
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com

_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to