I could go even further. New commit: http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=11bd517465360278b270ee7c 18607b4d1d97e44e
Mark messages are now correctly handled again, and the performance is equally well in that case because we do not need to do a full serialization. For those interested, I'd like to draw your attention to the lock-free CAS loop at line 1356 of action.c: http://git.adiscon.com/?p=rsyslog.git;a=blob;f=action.c;h=b8751c636038e344f4f 0c479be2e85cfce8ba6ff;hb=11bd517465360278b270ee7c18607b4d1d97e44e#l1344 It that handles time updates in a lock-free manner. My educated guess is that we can replace a couple of mutex calls by methods similar to this CAS loop and by that increase performance and decrease complexity. If you have not been in touch with lock-free methods, this is more or less speculative execution, where the computation result is discarded if some other thread was faster persisting its result. The computation then is retried. While this sounds a bit like waste of CPU time, it actually is more efficient, as we do not have all the overhead of locking methods (think context switches et al). There is good scientific literature backing this approach as being correct and efficient. Lock-free methods are often used inside operating systems themselves including very demanding real-time OSes. Of course, there are restrictions of what can be done with them, but often they provide an excellent alternative to mutexes and other blocking synchronization mechanisms. Rainer > -----Original Message----- > From: [email protected] [mailto:rsyslog- > [email protected]] On Behalf Of Rainer Gerhards > Sent: Tuesday, June 08, 2010 1:45 PM > To: rsyslog-users > Subject: Re: [rsyslog] discussion request: performance enhancement for > imtcp > > The good results seem to hold true in practice as well. I did a first > somewhat solid implementation, commit can be found here: > > http://git.adiscon.com/?p=rsyslog.git;a=commitdiff;h=3e49a1075ab6750135 > e1a38c > f0c213579fa30b4a > > If no special params are selected, firehose mode is used for stage 1 > processing (up to enqueue in action queue), providing full concurrency. > One > problem so far is the setting to emit mark messages only when no > messages > have recently been written. Unfortunately, that mode is the default. It > enforces some serialization (keep last written timestamp for action) > and thus > cannot be used in full firehose mode. For now, I have simply ignored > that > setting. In the somewhat longer term, I will probably end up with three > or > more slightly different algorithms, each dedicated to a specific set of > parameters. The "mark issue", for example, I think can be relatively > easy be > solved by lock-free synchronization via a CAS loop. I have put > considerably > effort into researching this since march, and I hope that work now pays > back. > > The test suite also tells me that something most not be 100% correct, > at > least omruleset has a problem with the new code (the test segfaults). > So > there is more work to do, but I am confident these are comparatively > minor > issues. > > Looking at the performance, on my "ad-hoc performance lab" (read: > virtualized > development environment) I get around ~ 70K mps both for the old and > the new > code (with the new one being a very slight bit faster) when I run them > on one > thread. This is a quad-core system, and three cores are almost idle. So > when > I select three main message queue workers, the old code goes *down* to > around > 40k mps -- this is the lock contention we have so often seen and > obviously > not yet solved. But the new code goes up to 120k mps, and three CPUs > get > utilized around 85%. So obviously, sync is still eating up more CPU > than it > should, but we get a considerable speedup of 1.7 (but low compared to > almost > three times as many CPUs). Anyhow, this is the work of maybe 10 hours > of > actual coding and analysis, so it doesn't look too bad ;) Also, it > tells that > I am probably on the right track -- and my previous investment in > research > pays back :). > > If someone want to test out the new version, please use the commit > above, I > am not sure what future commits will break. I'll tell you when I think > there > is another version suitable for testing. Please report back your > findings on > performance, I'd be very interested to hear about them. Keep in mind > that you > need to run the main message queue with more than one worker in order > to see > any difference. > > Rainer > > > -----Original Message----- > > From: [email protected] [mailto:rsyslog- > > [email protected]] On Behalf Of Rainer Gerhards > > Sent: Tuesday, June 08, 2010 12:04 PM > > To: rsyslog-users > > Subject: Re: [rsyslog] discussion request: performance enhancement > for > > imtcp > > > > OK, *very* interesting: I have now partially analyzed control flow as > > it is > > after all the changes to v4 and v5. Over the years, more and more > > features > > went into rsyslog, most of them used only in the minority of cases. > > Plus, > > there is "last message repeated n time", that I have elaborated quite > > often > > so far (and which I really dislike). > > > > I have now checked that most of these features require serialization > > inside > > the first stage of action processing. For example, rate limiting > > features > > like "do not execute this action more than 10 times within 5 seconds" > > require > > a proper count of message, and so do others. The end result is that > > this > > stage needs to be guarded by a lock and is so. So during that > > processing, no > > concurrency is possible. Also, the code path has become rather long > and > > complex. > > > > Now, these feature are there for a good reason. It doesn't make sense > > to > > remove them, just to get more performance. However, I now plan to > > partition > > the code, and serialize things only when selected features actually > > demand > > that. I have created a version that hardcodes a "firehose" mode where > > none of > > these esoteric features are used (and also "last message ..." is also > > disabled). It immediately more than doubled the throughput, without > > even a > > change to the action lock! I am not sure if this is a realistic > speedup > > (as I > > created a very specific environment), but it clearly shows there is > > value in > > that approach. > > > > So I will now see if I can create some code that enables a "firehose > > mode" if > > action parameters permit that (and the default action parameters, I > > think, > > should do so). > > > > Rainer > > > > > -----Original Message----- > > > From: [email protected] [mailto:rsyslog- > > > [email protected]] On Behalf Of Rainer Gerhards > > > Sent: Tuesday, June 08, 2010 9:16 AM > > > To: [email protected]; rsyslog-users > > > Subject: Re: [rsyslog] discussion request: performance enhancement > > for > > > imtcp > > > > > > > -----Original Message----- > > > > From: [email protected] [mailto:[email protected]] > > > > Sent: Monday, June 07, 2010 11:49 PM > > > > To: rsyslog-users > > > > Cc: Rainer Gerhards > > > > Subject: Re: [rsyslog] discussion request: performance > enhancement > > > for > > > > imtcp > > > > > > > > First off a comment that this may not make it through to the > list. > > > I'm > > > > having to use webmail to access my home e-mail from work and I > got > > > > bounced > > > > earlier today. > > > > > > > > On Mon, 7 Jun 2010 22:11:37 +0200, "Rainer Gerhards" > > > > <[email protected]> wrote: > > > > > Aaron, David, > > > > > > > > > > thanks again for the feedback. I did some testing the past few > > > hours, > > > > but > > > > > all > > > > > of them on v5, so the number may be quite different for v4. > > > > > > > > > > Based on what I have seen, the issue seems not to be lock > > > contention, > > > > but > > > > > rather locking quite long code sequences. In v5, we introduced > > > > batching, > > > > > but > > > > > we also (somewhat later) introduced transactions. A > transaction, > > > > however, > > > > > is > > > > > atomic, and so rsyslog must make sure that no two transactions > > are > > > > mixed > > > > > together. To do so, the engine applies the action lock when the > > > batch > > > > is > > > > > being processed and releases it when the batch is completely > > > > processed. > > > > > During that processing, not only the action is being called > > (which > > > > would > > > > be > > > > > fine), but also the to-be-passed strings are generated. > However, > > if > > > > there > > > > > is > > > > > a quick filter "in front of the action" (as *.* is!) this is > the > > > > majority > > > > > of > > > > > work that is being carried out. So in essence, due to > transaction > > > > support, > > > > > there can not be more than one worker thread in that large code > > > > sequence. I > > > > > have tested on a quad core machine, and I can't get any more > than > > > 1.3 > > > > CPUs > > > > > utilized when I have a single input and a single rule > > > > > > > > > > *.* /path/to/file > > > > > > > > > > I have then *completely* removed omfile processing (by > commenting > > > out > > > > > everything in doAction), and I still can't get up to more than > > 1.5 > > > > CPUs > > > > > being > > > > > utilized, all of this with 4 worker thread defined (the imtcp > > input > > > > uses > > > > > around 10 to 15% of a single CPU). > > > > > > > > could this be worked around by using multiple rulesets so that > the > > > main > > > > queue is not locked as long (or does this just result in the > > > secondary > > > > queues getting blocked) > > > > > > that later is the case: queue processing/output is not fast enough > > > > > > Let me differentiate output, that is the action, from (queue) > > > processing > > > here. This is important, because when I hear output, I always > > constrain > > > my > > > thinking to the actual output part minus other internal processing > > like > > > filters and the like. A core question is if the interim processing > or > > > the > > > actual output takes most of the time. > > > > > > > I thought that the transaction support worked like this (which > > would > > > > not > > > > cause the problem you are describing) > > > > > > > > lock queue > > > > mark N messages as being worked on > > > > unlock queue > > > > process messages > > > > lock queue > > > > if successful > > > > mark the messages tagged above as being completed (and therefor > > > > available to be removed, which may remove them) > > > > else > > > > mark the messages tagged above as not processed > > > > unlock queue > > > > > > > > you don't need to hold the lock while processing the records, > only > > > when > > > > tagging them (to make sure another thread doesn't tag them as > well) > > > > > > >From the POV of the queue this is correct and how things are > done... > > > > > > > > > > > from your description of the process it sounds like what is > > happening > > > > is > > > > > > > > lock queue > > > > mark N messages as being worked on > > > > create output strings for N messages > > > > unlock queue > > > > send messages > > > > lock queue > > > > if successful > > > > mark the messages tagged above as being completed (and therefor > > > > available to be removed, which may remove them) > > > > else > > > > mark the messages tagged above as not processed > > > > unlock queue > > > > > > > > the one thing you need to watch out for here is that you don't > move > > > > messages around while they are live (which will cause a little > bit > > of > > > > fragmentation) > > > > > > this is not correct. But you miss the action part. > > > > > > Remember that the output plugin interface specifies that only one > > > thread may > > > be inside an action at any time concurrently. This was introduced > to > > > facilitate writing output plugins. In theory, an output plugin can > > > request to > > > be called concurrently, but this is not yet implemented. So we need > > to > > > hold > > > on to the action lock (NOT queue lock) whenever we call an action. > > > > > > Even more, transactions mean that we must not interleave two or > more > > > batches. > > > Let's say we had two batches A und B, each with 4 messages. Then > > > calling the > > > output as follows: > > > > > > Abegin > > > Bbegin > > > A1 > > > A2 > > > B1 > > > A3 > > > B2 > > > A4 > > > Acommit > > > B3 > > > B4 > > > Bcommit > > > > > > would mean that at Acommit, messages A1,..,A4,B1,B2 would be > > committed. > > > This > > > could be worked around by far more complex ouput plugins. These > would > > > then > > > need to not only support concurrency but also keep separate > > > objects/connections for the various threads. This, if at all, makes > > > only > > > sense for database plugins. I don't see if the added overhead would > > > make any > > > sense at all to things like the file writer. > > > > > > But as we have already discussed, it is not so easy to keep the > file > > > writer > > > problem free in that case as well -- because it may get interrupted > > > during > > > writes (which means we need a lock, even if we manage to permit > more > > > concurrency inside the file writer). > > > > > > So in essence, the area to look at is that we can restructure the > > > output > > > plugin interface in regard to its transaction support. I am > currently > > > looking > > > at this area and have done some preliminary testing. My main > concern > > at > > > this > > > time is to find those spots that actually are the primary > bottlenecks > > > (at > > > this time, hopefully moving the border forward ;)). The past hours > I > > > thankfully was able to get same base results and match them with > what > > I > > > expect. At some other places, the results surprise me a bit. This > is > > > not > > > unexpected -- I had no time to touch that code (under a performance > > > poing of > > > few) for roughly a year, so I need to gain some new understanding. > > > Also, the > > > code has evolved, and it may be possible to refactor it into > > something > > > simpler (which is good for both performance and maintability). > > > > > > As one of the next things, I will probably use the "big memory, off > > > sync" > > > string generation, just to see the effects (it is rather > complicated > > to > > > get > > > that in cleanly, because there was so much optimization in v4 on > > cache > > > hit > > > efficiency, parts of which must be undone). Along that way, I will > > also > > > analyze the calling structure and search for simplifications. > > > > > > And, as usal, your feedback is very helpful and appreciated. Good > > > questions > > > often lead to good thinking and analysis and thus lead to thinks I > > had > > > not > > > thought about without them ;) > > > > > > As a side-note, I already identified one regression that caused > locks > > > to be > > > applied to often when messages were discarded. I already committed > a > > > fix for > > > that to the master branch. > > > > > > > > If this is a problem you can create a separate lock for modifying > > > > existing > > > > messages > > > > > > > > the locking rules would be something like > > > > > > > > Allow an unlimited number of worker threads to hold it as a > 'read' > > > lock > > > > A process trying to modify (defragment) the queue would try to > get > > it > > > > as a > > > > 'write' lock. > > > > If there are any readers holding the lock the writer stalls > waiting > > > for > > > > them > > > > If there is a process that has indicated that it wants the write > > > lock, > > > > new > > > > readers block waiting for the writer > > > > > > > > as long as the defragmentation runs are fairly rare compared to > > > normal > > > > operation this will be very efficient. > > > > > > > > But I suspect that there is not really a need to do the > > > defragmentation > > > > like this, just accept that some space may be wasted until the > > oldest > > > > messages finish being processed. > > > > > > > > > So this seem to be the culprit as least as far as v5 is > involved. > > > For > > > > v4, I > > > > > am pretty sure I will get a totally different picture, because > > > there > > > > is > > > > no > > > > > such coarse locking and the string generation can be run in > > > parallel > > > > (but > > > > > there we have lots of locking activity and contention). V5 > seems > > to > > > > be > > > > so > > > > > much faster, that this effect did not really surface. > > > > > > > > > > So what is the cure? An easy way would be to generate the > strings > > > for > > > > the > > > > > entire batch before beginning to process it. However, there is > a > > > > reason > > > > > this > > > > > so far is not done: especially with large batches, that would > > > require > > > > large > > > > > amount of memory. For example, let's say we have a batch of > 1,000 > > > > messages > > > > > and each string is 200 bytes long. So we would need to use > > 200,000 > > > > bytes, > > > > > or > > > > > roughly 200k to hold just the strings, where we currently use > > only > > > > 200 > > > > (but > > > > > repeatedly overwrite it). That amount probably has big impact > on > > > the > > > > cache > > > > > hit ratio: at worst, the first messages will be evicted from > > cache > > > by > > > > the > > > > > later messages. And as a batch is being processed from begin to > > > end, > > > > each > > > > > message will evict some strings that we will soon need. The end > > > > result > > > > > could > > > > > be really bad. For smaller batch sizes, of course, that would > not > > > be > > > > as > > > > > much > > > > > of a problem. > > > > > > > > I'm not sure how much of a problem this would be. it's not like > you > > > > will > > > > be repeatedly accessing the same strings, it's just that by the > > time > > > > you > > > > get to the end of the string the beginning may no longer be in > the > > > > cache. > > > > > > > > this is inefficient from the cache point of view, but it may > still > > be > > > a > > > > win overall if there is a high transaction overhead in the > output. > > In > > > > addition, depending on the output type you may not end up having > to > > > > read > > > > the data back into the cache at all (for example, the system may > be > > > > able to > > > > DMA the data directly to the output device, network card or disk > > > > controller) > > > > > > > > the number of messages to try and batch is already tunable, if > it's > > > set > > > > low you use small strings which fit in cache, if large you don't > > fit > > > in > > > > cache, but may save elsewhere. > > > > > > jup, and that's probably the next thing I look at (in code not yet > > > meant to > > > run in practice, so I may hardcode some things, just to see their > > > effect). > > > > > > > > > > > > The other (partial) solution that immediately came up my mind > was > > > to > > > > check > > > > > /be able to configure if we really *need* (or should use) > > > > transactions. > > > > If > > > > > not, there is no problem in interleaving messages and so we > would > > > not > > > > need > > > > > to > > > > > make sure the batch is being processed atomically and thus we > > could > > > > release > > > > > the lock while generating messages -- much like v4 does. > HOWEVER, > > > > that > > > > also > > > > > means we have much more locking activity, and so lock > contention > > > > begins > > > > to > > > > > become a concern again (somehow going in circles, here). > > > > > > > > > > Maybe it would be good to have both capabilities and let the > > > operator > > > > > decide > > > > > which one suites best. For starters, I will probably just try > to > > > > change > > > > the > > > > > string generation phase to use a big buffer and place that > > outside > > > of > > > > the > > > > > atomic transaction code. While I have quite some concerns, > > outlined > > > > above, > > > > > I > > > > > find it interesting to see the effect this has on the overall > > > > performance > > > > > (when used with reasonable batch sizes). This may be useful as > a > > > > guideline > > > > > for further work. > > > > > > > > we already allow the transaction size to be configured down to 1 > > > > (effectively disabled) so there's no need for a tunable there. > > > > > > > > But if the string generation is being done while the lock is > being > > > held > > > > it > > > > should definitely be moved outside of the lock. > > > > > > I agree, but remember the problems I saw in my first post. Anyhow, > > we > > > are > > > now looking at them. I also think it makes a lot of sense to > provide > > > different code pathes for plugins that actually need a > transactional > > > interface (like database outputs) and those that are not actually > > > transactional (like file output). The later code path can contain > > some > > > optimization over the first. > > > > > > Rainer > > > > > > > > David Lang > > > > > > > > > As usual, comments are greatly appreciated. > > > > > > > > > > Rainer > > > > > > > > > >> -----Original Message----- > > > > >> From: [email protected] [mailto:rsyslog- > > > > >> [email protected]] On Behalf Of Aaron Wiebe > > > > >> Sent: Monday, June 07, 2010 7:17 PM > > > > >> To: rsyslog-users > > > > >> Subject: Re: [rsyslog] discussion request: performance > > enhancement > > > > for > > > > >> imtcp > > > > >> > > > > >> Greetings, > > > > >> > > > > >> >> hmm, this could be locking overhead as well. One thing that > > you > > > > did > > > > >> >> early > > > > >> >> in v5 (I don't think it made it into v4) was to allow the > UDP > > > > >> receiver > > > > >> >> to > > > > >> >> insert multiple messages into the queue at once. That made > a > > > huge > > > > >> >> difference. > > > > >> > > > > > >> > No, I think that was something I did to both versions. At > some > > > > time, > > > > >> I did > > > > >> > optimizations to both v4 and v5, things like reducing > copies, > > > > >> reducing malloc > > > > >> > calls and so on. I am pretty sure submission batching was > > among > > > > them. > > > > >> > > > > >> I agree with David actually. While multiple tcp threads on > the > > > > input > > > > >> side certainly would be helpful, I believe the locking > overhead > > is > > > > >> likely the real culprit behind the inability to fully utilize > a > > > > >> multi-core machine with a single instance of rsyslog. In my > > > > >> experience, while the input thread was certainly relatively > > busy, > > > > the > > > > >> thread itself wasn't hitting a cpu bottleneck. Reducing some > of > > > the > > > > >> latencies around queuing and context switching is probably the > > > best > > > > >> place to spend time if the goal is improved performance. The > > > > earlier > > > > >> investigations into lockless queues combined with some > batching > > > may > > > > >> help to address these. As it stands, I don't regularly see > > > specific > > > > >> threads hitting cpu bottlenecks (assuming top -H is accurate). > > > > >> > > > > >> Also, if that is the problem (queues and context switching), > > > adding > > > > >> further division of work into imtcp may actually make the > > problem > > > > >> worse. That said, I'm not against reducing possible > bottlenecks > > > to > > > > >> get into the 1-10 gig input levels (at which this would > probably > > > > >> become an issue) - but I think the queues should be more > closely > > > > >> examined first. > > > > >> > > > > >> -Aaron > > > > >> _______________________________________________ > > > > >> rsyslog mailing list > > > > >> http://lists.adiscon.net/mailman/listinfo/rsyslog > > > > >> http://www.rsyslog.com > > > > > _______________________________________________ > > > > > rsyslog mailing list > > > > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > > > > http://www.rsyslog.com > > > _______________________________________________ > > > rsyslog mailing list > > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > > http://www.rsyslog.com > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

