Aaron, David,

thanks again for the feedback. I did some testing the past few hours, but all
of them on v5, so the number may be quite different for v4. 

Based on what I have seen, the issue seems not to be lock contention, but
rather locking quite long code sequences. In v5, we introduced batching, but
we also (somewhat later) introduced transactions. A transaction, however, is
atomic, and so rsyslog must make sure that no two transactions are mixed
together. To do so, the engine applies the action lock when the batch is
being processed and releases it when the batch is completely processed.
During that processing, not only the action is being called (which would be
fine), but also the to-be-passed strings are generated. However, if there is
a quick filter "in front of the action" (as *.* is!) this is the majority of
work that is being carried out. So in essence, due to transaction support,
there can not be more than one worker thread in that large code sequence. I
have tested on a quad core machine, and I can't get any more than 1.3 CPUs
utilized when I have a single input and a single rule

*.* /path/to/file

I have then *completely* removed omfile processing (by commenting out
everything in doAction), and I still can't get up to more than 1.5 CPUs being
utilized, all of this with 4 worker thread defined (the imtcp input uses
around 10 to 15% of a single CPU).

So this seem to be the culprit as least as far as v5 is involved. For v4, I
am pretty sure I will get a totally different picture, because there is no
such coarse locking and the string generation can be run in parallel (but
there we have lots of locking activity and contention). V5 seems to be so
much faster, that this effect did not really surface.

So what is the cure? An easy way would be to generate the strings for the
entire batch before beginning to process it. However, there is a reason this
so far is not done: especially with large batches, that would require large
amount of memory. For example, let's say we have a batch of 1,000 messages
and each string is 200 bytes long. So we would need to use 200,000 bytes, or
roughly 200k to hold just the strings, where we currently use only 200 (but
repeatedly overwrite it). That amount probably has big impact on the cache
hit ratio: at worst, the first messages will be evicted from cache by the
later messages. And as a batch is being processed from begin to end, each
message will evict some strings that we will soon need. The end result could
be really bad. For smaller batch sizes, of course, that would not be as much
of a problem.

The other (partial) solution that immediately came up my mind was to check
/be able to configure if we really *need* (or should use) transactions. If
not, there is no problem in interleaving messages and so we would not need to
make sure the batch is being processed atomically and thus we could release
the lock while generating messages -- much like v4 does. HOWEVER, that also
means we have much more locking activity, and so lock contention begins to
become a concern again (somehow going in circles, here).

Maybe it would be good to have both capabilities and let the operator decide
which one suites best. For starters, I will probably just try to change the
string generation phase to use a big buffer and place that outside of the
atomic transaction code. While I have quite some concerns, outlined above, I
find it interesting to see the effect this has on the overall performance
(when used with reasonable batch sizes). This may be useful as a guideline
for further work.

As usual, comments are greatly appreciated.

Rainer

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of Aaron Wiebe
> Sent: Monday, June 07, 2010 7:17 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] discussion request: performance enhancement for
> imtcp
> 
> Greetings,
> 
> >> hmm, this could be locking overhead as well. One thing that you did
> >> early
> >> in v5 (I don't think it made it into v4) was to allow the UDP
> receiver
> >> to
> >> insert multiple messages into the queue at once. That made a huge
> >> difference.
> >
> > No, I think that was something I did to both versions. At some time,
> I did
> > optimizations to both v4 and v5, things like reducing copies,
> reducing malloc
> > calls and so on. I am pretty sure submission batching was among them.
> 
> I agree with David actually.  While multiple tcp threads on the input
> side certainly would be helpful, I believe the locking overhead is
> likely the real culprit behind the inability to fully utilize a
> multi-core machine with a single instance of rsyslog.  In my
> experience, while the input thread was certainly relatively busy, the
> thread itself wasn't hitting a cpu bottleneck.  Reducing some of the
> latencies around queuing and context switching is probably the best
> place to spend time if the goal is improved performance.  The earlier
> investigations into lockless queues combined with some batching may
> help to address these.  As it stands, I don't regularly see specific
> threads hitting cpu bottlenecks (assuming top -H is accurate).
> 
> Also, if that is the problem (queues and context switching), adding
> further division of work into imtcp may actually make the problem
> worse.  That said, I'm not against reducing possible bottlenecks to
> get into the 1-10 gig input levels (at which this would probably
> become an issue) - but I think the queues should be more closely
> examined first.
> 
> -Aaron
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to