Re: [rsyslog] Race conditions and crashes

Rainer Gerhards Wed, 16 Feb 2011 09:49:24 -0800

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of Dražen Kacar
> Sent: Wednesday, February 16, 2011 6:38 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] Race conditions and crashes
> 
> Rainer Gerhards wrote:
> 
> > > My test was receiving 4 lines via UDP and writing them to a file
> and a
> > > FIFO.
> > > It was as simple as I could make it. Thread scheduling class was
> not
> > > set.
> >
> > did you experience any problems without the analyzer in this setting?
> As I
> > said, I am searching for this bug but so far we are unable to
> reproduce (I
> > even got some help from Florian, but so far to no avail...).
> 
> No, I can't reproduce the problem at will. I only see it on production
> servers under heavy load, but I can't run tools which slow down
> processing
> there.
> 
> OTOH, a while ago I reported perfectly reproducible crash with
> omruleset.
> I'll try that one under thread analyzer.


I'll probably try that out next. But I tend to stick a bit to the current
scenario -- will see...

> 
> And I could take a huge chunk of production data and feed it to the
> test
> system to see what happens.
> 
> > > race #1 - module loading
> > this is known and really no issue
> 
> Yes.
> 
> > > race #2 - shutdown all workers
> > > race #3 - thread destructor (this one might be responsible for
> > > something)
> > I think they are OK as well, but I will check. Maybe just atomic
> emulation is
> > missing. May also be that this is a case where it really doesn't
> matter if
> > dual reads are necessary.
> 
> I don't know when thread destructor is called. If only on rsyslog
> shutdown, then it doesn't really matter. If it can be called during
> normal
> processing then maybe that's causing the problem, but I don't think
> that's
> the one.

It can also be called during normal operation, when the workload reduces and
workers sit idle for a while. But all it does is set a flag to indicate the
worker should shut down (or "the other way round", that it has shut down), so
if the update is done late, it will just result in one more iteration in the
shutdown loop, a few ms delay).

> 
> > > race #4 - thread termination on SIGTTIN
> > sounds interesting, will check.
> 
> I think I just copyed that SIGTTIN from the comment in the code. I
> didn't
> send any signals except SIGTERM to terminate the process. I don't know
> if
> the thread analyzer inserts its own signal handlers.

SIGTTIN is used by rsyslog internally to indicate that the thread shall
terminate. The problem is that we must wake up blocking API calls, and
SIGTTIN does this. 

> 
> > And I think my initial answer was only partly correct. I assumed the
> tool was
> > something like clang static analyzer.
> 
> It's not static. Which is nice because it can find things that static
> analyzers can't, but the not so nice part is that you have to figure
> out a
> way to excercize all interesting code paths.
> 
> One significant difference between production and very light lab load
> is
> that the production is using disk queues. They were configured for the
> lab
> test as well, but the amount of data was small and nothing was ever
> written to disk.

Ah, that's an interesting hint! Disk queue mode is quite different and
disk-assisted queues are very complex beasts. Have you tried pure disk mode
(I don't think it'll cause the issue to appear, but maybe -- I was not yet on
this route).

> 
> Maybe the problem is in the disk writing code. And maybe not. :-(
> 
> I got several more problem reports from the tool. One of them is a real
> dirty read bug, but the data which is being read is then fed to the
> dbgprint
> statement, so that's not causing any problem. In the worst case you'd
> just
> get a wrong debug output printed out.

Also known. The reasoning is that the more you properly syncrhonize inside
the debug code, the more you change the runtime characteristics of the code
to be tested... So at many places inside the debug system races (with
cosmetic results) are deliberately accepted. It will get even worse in newer
builds, where I was able to identify one problem by totally disabling all
sync code (so far hardcoded, but will become an optional debug switch). The
joys of multithreading ;)
> 
> And there are a few which look very odd. Like this one:
> 
> Race #7, Vaddr: 0x41179db4
>       Access 1: Write, __pthread_unwind + 0x0000003B
>       Access 2: Read,  batchProcessed + 0x000000C8,
>                        line 1660 in "queue.c"
>   Total Callstack Traces: 1
>   Callstack Trace 1
>       Access 1: Write
>                 __pthread_unwind + 0x0000003B
>                 collector_root + 0x0000004C
>                 start_thread + 0x000000D5
>       Access 2: Read
>                 batchProcessed + 0x000000C8, line 1660 in "queue.c"
>                 wtiWorker + 0x000003A9, line 304 in "wti.c"
>                 wtpWorker + 0x000002A5, line 387 in "wtp.c"
>                 collector_root + 0x0000004C
>                 start_thread + 0x000000D5
> 
> That's from pthread_setcancelstate() call. I just don't want to go into
> that direction. :-)

Was that during shutdown? I don't think there is any other reason
pthread_cancel should be called in the recent v5 codebase...

Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] Race conditions and crashes

Reply via email to