On Thu, 15 Jan 2009, Rainer Gerhards wrote:

RG> On Thu, 2009-01-15 at 18:58 +0100, Lorenzo M. Catucci wrote:
RG> > I've just tried again rsyslog on my 8 core mail server, and got the very 
RG> > same crash from september/october. 
RG> 
RG> So, without valgrind, can you reproduce the issue each time you start
RG> it? That would be very useful.
RG> 

Yes: any time I start a free-running instance, I get the very same 
segmentation fault and core-file to backtrace.

RG> 
RG> > I've restarted the server under 
RG> > valgrind control, and all seems to be running well...
RG> 
RG> I guess the issue here is that valgrind slows down things and also
RG> simulates (I think) 2 CPUs only.
RG> 

Right, I didn't know valgrind both limited the CPU bandwidth and the 
(v)CPU number, but any of them would hide the existing race condition

RG> 
RG> From what I have learned so far we seem to have a race condition that
RG> causes memory corrupt. The backtrace you include also points into that
RG> direction. Those few cases where I got a usable backtrace all point to
RG> the very same location. However, that does not mean this location has
RG> the bug. It seems to occur some time earlier, and manifests when the
RG> message is destructed. It could be a double-free or even some wild
RG> memory access that accidently overwrites some structures.
RG> 
RG> If we are able to get a stable repro, and we are able to run with at
RG> least some minimal diagnostics, we may be much better of tackeling that
RG> beast.
RG> 
RG> First step is to see that we get a stable repro. If we do, I need to
RG> think about minimal debug. The full debugging system makes the bug
RG> disappear, I think because it changes the timing.
RG> 

I don't think we could hope for a stable reproducer for an heisen-bug... 
all I can provide is a very high throughput system generating a very high 
local message rate. As a matter of facts, this rsyslog instance is 
acting as a forwader to a remote instance that didn't suffer any crash.

The only differences between the engines' configurations are:
  1. the remote logs to  a postgres instance instead of spool files,
  2. the remote does just run the postgresql instance and the logger

My gut feeling is that the different behaviour doesn't come from any of 
these differences, but from the different memory-path taken from the 
messages, which in the remote case are serialised from the underlying 
network transport.

We'll see! Yours,

        lorenzo



+-------------------------+----------------------------------------------+
|   Lorenzo M.  Catucci   | Centro di Calcolo e Documentazione           |
| [email protected] | Università degli Studi di Roma "Tor Vergata" |
|                         | Via O. Raimondo 18 ** I-00173 ROMA  ** ITALY |
|  Tel. +39 06 7259 2255  | Fax. +39 06 7259 2125                        |
+-------------------------+----------------------------------------------+
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to