David, my tests (and your responses) made my thinking evolve a bit. I think that the problem is probably related to some corruption of stack variables (not the stack frame itself, but some pointers on it). Among others, one reason is that I don't see any real problems under valgrind, while I still get a violation in some code that looks totally unrelated. The best explanation I can think of is this type of stack-based error, which can *not* be detected by valgrind. Also, a lot of the optimizations I made involved moving (costly) heap memory allocations to (far cheaper) stack memory allocations. So this is another indication that the problem may be routed in the stack area.
There is one thing that would help verify this assumption: it would be great if you could run 4.2.0 (.0 is important!) on the system that experiences the problem. If that runs stable, the problem source is very probably in the optimizations I did. If it still crashes, well, then I need a new theory ;) Is this possible? As a side-note: I think that my UDP message loss may partly be related to DNS resolution. I will try this in a lab tomorrow. But I still think a lot of packets never leave the source system. This may be related to the virtual environment I am currently using for the lab. I hope to be able to generate the traffic by a program, because that offers me the flexibility (now and in the future) to test complex messages scenarios (what, granted, does not help if it does not expose the problem...). Rainer > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of [email protected] > Sent: Monday, August 31, 2009 5:38 PM > To: rsyslog-users > Subject: Re: [rsyslog] abort in 4.2.1 > > On Mon, 31 Aug 2009, Rainer Gerhards wrote: > > > On Fri, 2009-08-28 at 14:55 -0700, [email protected] wrote: > >> On Fri, 28 Aug 2009, Rainer Gerhards wrote: > >>> Also, it would be good if you could --enable-rtinst > --enable-debug and try > >>> out that version on your machine. I am a bit concerned > about the speed of the > >>> resulting executable, it may be too slow. You do not need > to run it in debug > >>> mode itself. These option (especially--enable-debug) will > activate in-depth > >>> runtime checks (assert, will abort when something wrong > happens) and my hope > >>> is that they will catch the bug closer to the root cause. > If so, I would need > >>> the gdb abort info (actually enabling debug output would > be an option some > >>> time later). > >>> > >>> Please let me know what would be OK with you. > >> > >> I will give this a try. > >> > >> I was going to suggest that since we have the message > getting corrupted it > >> may make sense to make a temporary branch that has multiple message > >> buffers and at various times through the message > processing it makes a > >> copy of the emssage to the buffer. when the system crashes > I will be able > >> to look at the core and see where the message is getting corrupted. > > > > David, I fear it is even more complicated than that. It > looks like not > > only the message got corrupted but the message object > itself. There are > > already two copies of some of the message elements, and > they also look > > inconsistent - except, if we really had a null message, > that is one with > > no content at all (and generating a message object from a > null message, > > I think, would be a bug in itself - but I am sure there are no such > > messages in your actual traffic). If you think there could be a real > > null message, I'd follow that path (will probably do so in > any case...). > > I know that in some places on my network I am seeing > malformed messages > that look like they are overflowing one packet and so trying > to go into a > second packet (with the result being 20 or so characters > being the entire > contents of the message and showing up as the system name > with no actual > system tag or message folowing it) > > it's possible that there are packets with nothing in them, > but I am not > aware of them. > > > I think that what really happens is that some part of the code runs > > wild, thus invalidating some random part of the main memory. At some > > times, it hits queue structures (or the message object that > is held by > > them) and if so, we will see the abort you experience. With that > > scenario, duplicating the message buffer does not really > help, because > > looking at the corrupted message object would not provide > any additional > > information. > > ouch > > > However, if that's easy enough to reproduce, it would > probably be good > > if you could send me the core analysis (the backtrace and the print > > statements) from a few (five maybe?) independent aborts. > Maybe they show > > a pattern. It would probably best to send them via private > mail, as I am > > not sure if they disclose more than they should. > > I will see about doing that. > > >> > >> I will see about doing a tcpdump at the time that I do > this and send it to > >> you (I'll need to check with management, but since we have > a contract in > >> place for other reasons I think we can do this) > >> > > > > That would probably be a good thing. I've made some progress with my > > testing tool, and I have created a basic version right now. > Probably not > > good enough to mimic your traffic pattern, but closer. I am > doing a test > > run for quite some time now, unfortunately so far without abort. > > > > Note that I run into the trouble with UDP - even though > I've put some > > one-ms sleeps into the code, I lose a lot of messages, as > it looks even > > before they hit the wire. It's always real trobulesome to test with > > UDP... > > interesting. I have been able to get very high transmission > rates with UDP > without loosing packets. > > what I did was to use syslog to generate sample messages, > captured them > with tcpdump, and then used tcpreplay to send them at varying > data rates. > > David Lang > > > Rainer > >> I can't do this late on a friday, but I should be able to > do this monday > >> afternoon. > >> > >> David Lang > >> _______________________________________________ > >> rsyslog mailing list > >> http://lists.adiscon.net/mailman/listinfo/rsyslog > >> http://www.rsyslog.com > > > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > > > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

