Re: [rsyslog] discussion request: performance enhancement for imtcp

Rainer Gerhards Tue, 08 Jun 2010 03:03:53 -0700

OK, *very* interesting: I have now partially analyzed control flow as it is
after all the changes to v4 and v5. Over the years, more and more features
went into rsyslog, most of them used only in the minority of cases. Plus,
there is "last message repeated n time", that I have elaborated quite often
so far (and which I really dislike).


I have now checked that most of these features require serialization inside
the first stage of action processing. For example, rate limiting features
like "do not execute this action more than 10 times within 5 seconds" require
a proper count of message, and so do others. The end result is that this
stage needs to be guarded by a lock and is so. So during that processing, no
concurrency is possible. Also, the code path has become rather long and
complex.

Now, these feature are there for a good reason. It doesn't make sense to
remove them, just to get more performance. However, I now plan to partition
the code, and serialize things only when selected features actually demand
that. I have created a version that hardcodes a "firehose" mode where none of
these esoteric features are used (and also "last message ..." is also
disabled). It immediately more than doubled the throughput, without even a
change to the action lock! I am not sure if this is a realistic speedup (as I
created a very specific environment), but it clearly shows there is value in
that approach.

So I will now see if I can create some code that enables a "firehose mode" if
action parameters permit that (and the default action parameters, I think,
should do so).

Rainer

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of Rainer Gerhards
> Sent: Tuesday, June 08, 2010 9:16 AM
> To: [email protected]; rsyslog-users
> Subject: Re: [rsyslog] discussion request: performance enhancement for
> imtcp
> 
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: Monday, June 07, 2010 11:49 PM
> > To: rsyslog-users
> > Cc: Rainer Gerhards
> > Subject: Re: [rsyslog] discussion request: performance enhancement
> for
> > imtcp
> >
> > First off a comment that this may not make it through to the list.
> I'm
> > having to use webmail to access my home e-mail from work and I got
> > bounced
> > earlier today.
> >
> > On Mon, 7 Jun 2010 22:11:37 +0200, "Rainer Gerhards"
> > <[email protected]> wrote:
> > > Aaron, David,
> > >
> > > thanks again for the feedback. I did some testing the past few
> hours,
> > but
> > > all
> > > of them on v5, so the number may be quite different for v4.
> > >
> > > Based on what I have seen, the issue seems not to be lock
> contention,
> > but
> > > rather locking quite long code sequences. In v5, we introduced
> > batching,
> > > but
> > > we also (somewhat later) introduced transactions. A transaction,
> > however,
> > > is
> > > atomic, and so rsyslog must make sure that no two transactions are
> > mixed
> > > together. To do so, the engine applies the action lock when the
> batch
> > is
> > > being processed and releases it when the batch is completely
> > processed.
> > > During that processing, not only the action is being called (which
> > would
> > be
> > > fine), but also the to-be-passed strings are generated. However, if
> > there
> > > is
> > > a quick filter "in front of the action" (as *.* is!) this is the
> > majority
> > > of
> > > work that is being carried out. So in essence, due to transaction
> > support,
> > > there can not be more than one worker thread in that large code
> > sequence. I
> > > have tested on a quad core machine, and I can't get any more than
> 1.3
> > CPUs
> > > utilized when I have a single input and a single rule
> > >
> > > *.* /path/to/file
> > >
> > > I have then *completely* removed omfile processing (by commenting
> out
> > > everything in doAction), and I still can't get up to more than 1.5
> > CPUs
> > > being
> > > utilized, all of this with 4 worker thread defined (the imtcp input
> > uses
> > > around 10 to 15% of a single CPU).
> >
> > could this be worked around by using multiple rulesets so that the
> main
> > queue is not locked as long (or does this just result in the
> secondary
> > queues getting blocked)
> 
> that later is the case: queue processing/output is not fast enough
> 
> Let me differentiate output, that is the action, from (queue)
> processing
> here. This is important, because when I hear output, I always constrain
> my
> thinking to the actual output part minus other internal processing like
> filters and the like. A core question is if the interim processing or
> the
> actual output takes most of the time.
> 
> > I thought that the transaction support worked like this (which would
> > not
> > cause the problem you are describing)
> >
> > lock queue
> > mark N messages as being worked on
> > unlock queue
> > process messages
> > lock queue
> > if successful
> >   mark the messages tagged above as being completed (and therefor
> > available to be removed, which may remove them)
> > else
> >   mark the messages tagged above as not processed
> > unlock queue
> >
> > you don't need to hold the lock while processing the records, only
> when
> > tagging them (to make sure another thread doesn't tag them as well)
> 
> >From the POV of the queue this is correct and how things are done...
> 
> >
> > from your description of the process it sounds like what is happening
> > is
> >
> > lock queue
> > mark N messages as being worked on
> > create output strings for N messages
> > unlock queue
> > send messages
> > lock queue
> > if successful
> >   mark the messages tagged above as being completed (and therefor
> > available to be removed, which may remove them)
> > else
> >   mark the messages tagged above as not processed
> > unlock queue
> >
> > the one thing you need to watch out for here is that you don't move
> > messages around while they are live (which will cause a little bit of
> > fragmentation)
> 
> this is not correct. But you miss the action part.
> 
> Remember that the output plugin interface specifies that only one
> thread may
> be inside an action at any time concurrently. This was introduced to
> facilitate writing output plugins. In theory, an output plugin can
> request to
> be called concurrently, but this is not yet implemented. So we need to
> hold
> on to the action lock (NOT queue lock) whenever we call an action.
> 
> Even more, transactions mean that we must not interleave two or more
> batches.
> Let's say we had two batches A und B, each with 4 messages. Then
> calling the
> output as follows:
> 
> Abegin
> Bbegin
> A1
> A2
> B1
> A3
> B2
> A4
> Acommit
> B3
> B4
> Bcommit
> 
> would mean that at Acommit, messages A1,..,A4,B1,B2 would be committed.
> This
> could be worked around by far more complex ouput plugins. These would
> then
> need to not only support concurrency but also keep separate
> objects/connections for the various threads. This, if at all, makes
> only
> sense for database plugins. I don't see if the added overhead would
> make any
> sense at all to things like the file writer.
> 
> But as we have already discussed, it is not so easy to keep the file
> writer
> problem free in that case as well -- because it may get interrupted
> during
> writes (which means we need a lock, even if we manage to permit more
> concurrency inside the file writer).
> 
> So in essence, the area to look at is that we can restructure the
> output
> plugin interface in regard to its transaction support. I am currently
> looking
> at this area and have done some preliminary testing. My main concern at
> this
> time is to find those spots that actually are the primary bottlenecks
> (at
> this time, hopefully moving the border forward ;)). The past hours I
> thankfully was able to get same base results and match them with what I
> expect. At some other places, the results surprise me a bit. This is
> not
> unexpected -- I had no time to touch that code (under a performance
> poing of
> few) for roughly a year, so I need to gain some new understanding.
> Also, the
> code has evolved, and it may be possible to refactor it into something
> simpler (which is good for both performance and maintability).
> 
> As one of the next things, I will probably use the "big memory, off
> sync"
> string generation, just to see the effects (it is rather complicated to
> get
> that in cleanly, because there was so much optimization in v4 on cache
> hit
> efficiency, parts of which must be undone). Along that way, I will also
> analyze the calling structure and search for simplifications.
> 
> And, as usal, your feedback is very helpful and appreciated. Good
> questions
> often lead to good thinking and analysis and thus lead to thinks I had
> not
> thought about without them ;)
> 
> As a side-note, I already identified one regression that caused locks
> to be
> applied to often when messages were discarded. I already committed a
> fix for
> that to the master branch.
> >
> > If this is a problem you can create a separate lock for modifying
> > existing
> > messages
> >
> > the locking rules would be something like
> >
> > Allow an unlimited number of worker threads to hold it as a 'read'
> lock
> > A process trying to modify (defragment) the queue would try to get it
> > as a
> > 'write' lock.
> > If there are any readers holding the lock the writer stalls waiting
> for
> > them
> > If there is a process that has indicated that it wants the write
> lock,
> > new
> > readers block waiting for the writer
> >
> > as long as the defragmentation runs are fairly rare compared to
> normal
> > operation this will be very efficient.
> >
> > But I suspect that there is not really a need to do the
> defragmentation
> > like this, just accept that some space may be wasted until the oldest
> > messages finish being processed.
> >
> > > So this seem to be the culprit as least as far as v5 is involved.
> For
> > v4, I
> > > am pretty sure I will get a totally different picture, because
> there
> > is
> > no
> > > such coarse locking and the string generation can be run in
> parallel
> > (but
> > > there we have lots of locking activity and contention). V5 seems to
> > be
> > so
> > > much faster, that this effect did not really surface.
> > >
> > > So what is the cure? An easy way would be to generate the strings
> for
> > the
> > > entire batch before beginning to process it. However, there is a
> > reason
> > > this
> > > so far is not done: especially with large batches, that would
> require
> > large
> > > amount of memory. For example, let's say we have a batch of 1,000
> > messages
> > > and each string is 200 bytes long. So we would need to use 200,000
> > bytes,
> > > or
> > > roughly 200k to hold just the strings, where we currently use only
> > 200
> > (but
> > > repeatedly overwrite it). That amount probably has big impact on
> the
> > cache
> > > hit ratio: at worst, the first messages will be evicted from cache
> by
> > the
> > > later messages. And as a batch is being processed from begin to
> end,
> > each
> > > message will evict some strings that we will soon need. The end
> > result
> > > could
> > > be really bad. For smaller batch sizes, of course, that would not
> be
> > as
> > > much
> > > of a problem.
> >
> > I'm not sure how much of a problem this would be. it's not like you
> > will
> > be repeatedly accessing the same strings, it's just that by the time
> > you
> > get to the end of the string the beginning may no longer be in the
> > cache.
> >
> > this is inefficient from the cache point of view, but it may still be
> a
> > win overall if there is a high transaction overhead in the output. In
> > addition, depending on the output type you may not end up having to
> > read
> > the data back into the cache at all (for example, the system may be
> > able to
> > DMA the data directly to the output device, network card or disk
> > controller)
> >
> > the number of messages to try and batch is already tunable, if it's
> set
> > low you use small strings which fit in cache, if large you don't fit
> in
> > cache, but may save elsewhere.
> 
> jup, and that's probably the next thing I look at (in code not yet
> meant to
> run in practice, so I may hardcode some things, just to see their
> effect).
> 
> >
> > > The other (partial) solution that immediately came up my mind was
> to
> > check
> > > /be able to configure if we really *need* (or should use)
> > transactions.
> > If
> > > not, there is no problem in interleaving messages and so we would
> not
> > need
> > > to
> > > make sure the batch is being processed atomically and thus we could
> > release
> > > the lock while generating messages -- much like v4 does. HOWEVER,
> > that
> > also
> > > means we have much more locking activity, and so lock contention
> > begins
> > to
> > > become a concern again (somehow going in circles, here).
> > >
> > > Maybe it would be good to have both capabilities and let the
> operator
> > > decide
> > > which one suites best. For starters, I will probably just try to
> > change
> > the
> > > string generation phase to use a big buffer and place that outside
> of
> > the
> > > atomic transaction code. While I have quite some concerns, outlined
> > above,
> > > I
> > > find it interesting to see the effect this has on the overall
> > performance
> > > (when used with reasonable batch sizes). This may be useful as a
> > guideline
> > > for further work.
> >
> > we already allow the transaction size to be configured down to 1
> > (effectively disabled) so there's no need for a tunable there.
> >
> > But if the string generation is being done while the lock is being
> held
> > it
> > should definitely be moved outside of the lock.
> 
> I agree, but remember the problems I saw in my first post.  Anyhow, we
> are
> now looking at them. I also think it makes a lot of sense to provide
> different code pathes for plugins that actually need a transactional
> interface (like database outputs) and those that are not actually
> transactional (like file output). The later code path can contain some
> optimization over the first.
> 
> Rainer
> >
> > David Lang
> >
> > > As usual, comments are greatly appreciated.
> > >
> > > Rainer
> > >
> > >> -----Original Message-----
> > >> From: [email protected] [mailto:rsyslog-
> > >> [email protected]] On Behalf Of Aaron Wiebe
> > >> Sent: Monday, June 07, 2010 7:17 PM
> > >> To: rsyslog-users
> > >> Subject: Re: [rsyslog] discussion request: performance enhancement
> > for
> > >> imtcp
> > >>
> > >> Greetings,
> > >>
> > >> >> hmm, this could be locking overhead as well. One thing that you
> > did
> > >> >> early
> > >> >> in v5 (I don't think it made it into v4) was to allow the UDP
> > >> receiver
> > >> >> to
> > >> >> insert multiple messages into the queue at once. That made a
> huge
> > >> >> difference.
> > >> >
> > >> > No, I think that was something I did to both versions. At some
> > time,
> > >> I did
> > >> > optimizations to both v4 and v5, things like reducing copies,
> > >> reducing malloc
> > >> > calls and so on. I am pretty sure submission batching was among
> > them.
> > >>
> > >> I agree with David actually.  While multiple tcp threads on the
> > input
> > >> side certainly would be helpful, I believe the locking overhead is
> > >> likely the real culprit behind the inability to fully utilize a
> > >> multi-core machine with a single instance of rsyslog.  In my
> > >> experience, while the input thread was certainly relatively busy,
> > the
> > >> thread itself wasn't hitting a cpu bottleneck.  Reducing some of
> the
> > >> latencies around queuing and context switching is probably the
> best
> > >> place to spend time if the goal is improved performance.  The
> > earlier
> > >> investigations into lockless queues combined with some batching
> may
> > >> help to address these.  As it stands, I don't regularly see
> specific
> > >> threads hitting cpu bottlenecks (assuming top -H is accurate).
> > >>
> > >> Also, if that is the problem (queues and context switching),
> adding
> > >> further division of work into imtcp may actually make the problem
> > >> worse.  That said, I'm not against reducing possible bottlenecks
> to
> > >> get into the 1-10 gig input levels (at which this would probably
> > >> become an issue) - but I think the queues should be more closely
> > >> examined first.
> > >>
> > >> -Aaron
> > >> _______________________________________________
> > >> rsyslog mailing list
> > >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> > >> http://www.rsyslog.com
> > > _______________________________________________
> > > rsyslog mailing list
> > > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > > http://www.rsyslog.com
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] discussion request: performance enhancement for imtcp

Reply via email to