Aaron, David, thanks again for the feedback. I did some testing the past few hours, but all of them on v5, so the number may be quite different for v4.
Based on what I have seen, the issue seems not to be lock contention, but rather locking quite long code sequences. In v5, we introduced batching, but we also (somewhat later) introduced transactions. A transaction, however, is atomic, and so rsyslog must make sure that no two transactions are mixed together. To do so, the engine applies the action lock when the batch is being processed and releases it when the batch is completely processed. During that processing, not only the action is being called (which would be fine), but also the to-be-passed strings are generated. However, if there is a quick filter "in front of the action" (as *.* is!) this is the majority of work that is being carried out. So in essence, due to transaction support, there can not be more than one worker thread in that large code sequence. I have tested on a quad core machine, and I can't get any more than 1.3 CPUs utilized when I have a single input and a single rule *.* /path/to/file I have then *completely* removed omfile processing (by commenting out everything in doAction), and I still can't get up to more than 1.5 CPUs being utilized, all of this with 4 worker thread defined (the imtcp input uses around 10 to 15% of a single CPU). So this seem to be the culprit as least as far as v5 is involved. For v4, I am pretty sure I will get a totally different picture, because there is no such coarse locking and the string generation can be run in parallel (but there we have lots of locking activity and contention). V5 seems to be so much faster, that this effect did not really surface. So what is the cure? An easy way would be to generate the strings for the entire batch before beginning to process it. However, there is a reason this so far is not done: especially with large batches, that would require large amount of memory. For example, let's say we have a batch of 1,000 messages and each string is 200 bytes long. So we would need to use 200,000 bytes, or roughly 200k to hold just the strings, where we currently use only 200 (but repeatedly overwrite it). That amount probably has big impact on the cache hit ratio: at worst, the first messages will be evicted from cache by the later messages. And as a batch is being processed from begin to end, each message will evict some strings that we will soon need. The end result could be really bad. For smaller batch sizes, of course, that would not be as much of a problem. The other (partial) solution that immediately came up my mind was to check /be able to configure if we really *need* (or should use) transactions. If not, there is no problem in interleaving messages and so we would not need to make sure the batch is being processed atomically and thus we could release the lock while generating messages -- much like v4 does. HOWEVER, that also means we have much more locking activity, and so lock contention begins to become a concern again (somehow going in circles, here). Maybe it would be good to have both capabilities and let the operator decide which one suites best. For starters, I will probably just try to change the string generation phase to use a big buffer and place that outside of the atomic transaction code. While I have quite some concerns, outlined above, I find it interesting to see the effect this has on the overall performance (when used with reasonable batch sizes). This may be useful as a guideline for further work. As usual, comments are greatly appreciated. Rainer > -----Original Message----- > From: [email protected] [mailto:rsyslog- > [email protected]] On Behalf Of Aaron Wiebe > Sent: Monday, June 07, 2010 7:17 PM > To: rsyslog-users > Subject: Re: [rsyslog] discussion request: performance enhancement for > imtcp > > Greetings, > > >> hmm, this could be locking overhead as well. One thing that you did > >> early > >> in v5 (I don't think it made it into v4) was to allow the UDP > receiver > >> to > >> insert multiple messages into the queue at once. That made a huge > >> difference. > > > > No, I think that was something I did to both versions. At some time, > I did > > optimizations to both v4 and v5, things like reducing copies, > reducing malloc > > calls and so on. I am pretty sure submission batching was among them. > > I agree with David actually. While multiple tcp threads on the input > side certainly would be helpful, I believe the locking overhead is > likely the real culprit behind the inability to fully utilize a > multi-core machine with a single instance of rsyslog. In my > experience, while the input thread was certainly relatively busy, the > thread itself wasn't hitting a cpu bottleneck. Reducing some of the > latencies around queuing and context switching is probably the best > place to spend time if the goal is improved performance. The earlier > investigations into lockless queues combined with some batching may > help to address these. As it stands, I don't regularly see specific > threads hitting cpu bottlenecks (assuming top -H is accurate). > > Also, if that is the problem (queues and context switching), adding > further division of work into imtcp may actually make the problem > worse. That said, I'm not against reducing possible bottlenecks to > get into the 1-10 gig input levels (at which this would probably > become an issue) - but I think the queues should be more closely > examined first. > > -Aaron > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

