> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> Sent: Monday, June 07, 2010 11:49 PM
> To: rsyslog-users
> Cc: Rainer Gerhards
> Subject: Re: [rsyslog] discussion request: performance enhancement for
> imtcp
> 
> First off a comment that this may not make it through to the list. I'm
> having to use webmail to access my home e-mail from work and I got
> bounced
> earlier today.
> 
> On Mon, 7 Jun 2010 22:11:37 +0200, "Rainer Gerhards"
> <[email protected]> wrote:
> > Aaron, David,
> >
> > thanks again for the feedback. I did some testing the past few hours,
> but
> > all
> > of them on v5, so the number may be quite different for v4.
> >
> > Based on what I have seen, the issue seems not to be lock contention,
> but
> > rather locking quite long code sequences. In v5, we introduced
> batching,
> > but
> > we also (somewhat later) introduced transactions. A transaction,
> however,
> > is
> > atomic, and so rsyslog must make sure that no two transactions are
> mixed
> > together. To do so, the engine applies the action lock when the batch
> is
> > being processed and releases it when the batch is completely
> processed.
> > During that processing, not only the action is being called (which
> would
> be
> > fine), but also the to-be-passed strings are generated. However, if
> there
> > is
> > a quick filter "in front of the action" (as *.* is!) this is the
> majority
> > of
> > work that is being carried out. So in essence, due to transaction
> support,
> > there can not be more than one worker thread in that large code
> sequence. I
> > have tested on a quad core machine, and I can't get any more than 1.3
> CPUs
> > utilized when I have a single input and a single rule
> >
> > *.* /path/to/file
> >
> > I have then *completely* removed omfile processing (by commenting out
> > everything in doAction), and I still can't get up to more than 1.5
> CPUs
> > being
> > utilized, all of this with 4 worker thread defined (the imtcp input
> uses
> > around 10 to 15% of a single CPU).
> 
> could this be worked around by using multiple rulesets so that the main
> queue is not locked as long (or does this just result in the secondary
> queues getting blocked)

that later is the case: queue processing/output is not fast enough

Let me differentiate output, that is the action, from (queue) processing
here. This is important, because when I hear output, I always constrain my
thinking to the actual output part minus other internal processing like
filters and the like. A core question is if the interim processing or the
actual output takes most of the time.

> I thought that the transaction support worked like this (which would
> not
> cause the problem you are describing)
> 
> lock queue
> mark N messages as being worked on
> unlock queue
> process messages
> lock queue
> if successful
>   mark the messages tagged above as being completed (and therefor
> available to be removed, which may remove them)
> else
>   mark the messages tagged above as not processed
> unlock queue
> 
> you don't need to hold the lock while processing the records, only when
> tagging them (to make sure another thread doesn't tag them as well)

>From the POV of the queue this is correct and how things are done...

> 
> from your description of the process it sounds like what is happening
> is
> 
> lock queue
> mark N messages as being worked on
> create output strings for N messages
> unlock queue
> send messages
> lock queue
> if successful
>   mark the messages tagged above as being completed (and therefor
> available to be removed, which may remove them)
> else
>   mark the messages tagged above as not processed
> unlock queue
> 
> the one thing you need to watch out for here is that you don't move
> messages around while they are live (which will cause a little bit of
> fragmentation)

this is not correct. But you miss the action part.

Remember that the output plugin interface specifies that only one thread may
be inside an action at any time concurrently. This was introduced to
facilitate writing output plugins. In theory, an output plugin can request to
be called concurrently, but this is not yet implemented. So we need to hold
on to the action lock (NOT queue lock) whenever we call an action.

Even more, transactions mean that we must not interleave two or more batches.
Let's say we had two batches A und B, each with 4 messages. Then calling the
output as follows:

Abegin
Bbegin
A1
A2
B1
A3
B2
A4
Acommit
B3
B4
Bcommit

would mean that at Acommit, messages A1,..,A4,B1,B2 would be committed. This
could be worked around by far more complex ouput plugins. These would then
need to not only support concurrency but also keep separate
objects/connections for the various threads. This, if at all, makes only
sense for database plugins. I don't see if the added overhead would make any
sense at all to things like the file writer.

But as we have already discussed, it is not so easy to keep the file writer
problem free in that case as well -- because it may get interrupted during
writes (which means we need a lock, even if we manage to permit more
concurrency inside the file writer).

So in essence, the area to look at is that we can restructure the output
plugin interface in regard to its transaction support. I am currently looking
at this area and have done some preliminary testing. My main concern at this
time is to find those spots that actually are the primary bottlenecks (at
this time, hopefully moving the border forward ;)). The past hours I
thankfully was able to get same base results and match them with what I
expect. At some other places, the results surprise me a bit. This is not
unexpected -- I had no time to touch that code (under a performance poing of
few) for roughly a year, so I need to gain some new understanding. Also, the
code has evolved, and it may be possible to refactor it into something
simpler (which is good for both performance and maintability).

As one of the next things, I will probably use the "big memory, off sync"
string generation, just to see the effects (it is rather complicated to get
that in cleanly, because there was so much optimization in v4 on cache hit
efficiency, parts of which must be undone). Along that way, I will also
analyze the calling structure and search for simplifications.

And, as usal, your feedback is very helpful and appreciated. Good questions
often lead to good thinking and analysis and thus lead to thinks I had not
thought about without them ;)

As a side-note, I already identified one regression that caused locks to be
applied to often when messages were discarded. I already committed a fix for
that to the master branch.
> 
> If this is a problem you can create a separate lock for modifying
> existing
> messages
> 
> the locking rules would be something like
> 
> Allow an unlimited number of worker threads to hold it as a 'read' lock
> A process trying to modify (defragment) the queue would try to get it
> as a
> 'write' lock.
> If there are any readers holding the lock the writer stalls waiting for
> them
> If there is a process that has indicated that it wants the write lock,
> new
> readers block waiting for the writer
> 
> as long as the defragmentation runs are fairly rare compared to normal
> operation this will be very efficient.
> 
> But I suspect that there is not really a need to do the defragmentation
> like this, just accept that some space may be wasted until the oldest
> messages finish being processed.
> 
> > So this seem to be the culprit as least as far as v5 is involved. For
> v4, I
> > am pretty sure I will get a totally different picture, because there
> is
> no
> > such coarse locking and the string generation can be run in parallel
> (but
> > there we have lots of locking activity and contention). V5 seems to
> be
> so
> > much faster, that this effect did not really surface.
> >
> > So what is the cure? An easy way would be to generate the strings for
> the
> > entire batch before beginning to process it. However, there is a
> reason
> > this
> > so far is not done: especially with large batches, that would require
> large
> > amount of memory. For example, let's say we have a batch of 1,000
> messages
> > and each string is 200 bytes long. So we would need to use 200,000
> bytes,
> > or
> > roughly 200k to hold just the strings, where we currently use only
> 200
> (but
> > repeatedly overwrite it). That amount probably has big impact on the
> cache
> > hit ratio: at worst, the first messages will be evicted from cache by
> the
> > later messages. And as a batch is being processed from begin to end,
> each
> > message will evict some strings that we will soon need. The end
> result
> > could
> > be really bad. For smaller batch sizes, of course, that would not be
> as
> > much
> > of a problem.
> 
> I'm not sure how much of a problem this would be. it's not like you
> will
> be repeatedly accessing the same strings, it's just that by the time
> you
> get to the end of the string the beginning may no longer be in the
> cache.
> 
> this is inefficient from the cache point of view, but it may still be a
> win overall if there is a high transaction overhead in the output. In
> addition, depending on the output type you may not end up having to
> read
> the data back into the cache at all (for example, the system may be
> able to
> DMA the data directly to the output device, network card or disk
> controller)
> 
> the number of messages to try and batch is already tunable, if it's set
> low you use small strings which fit in cache, if large you don't fit in
> cache, but may save elsewhere.

jup, and that's probably the next thing I look at (in code not yet meant to
run in practice, so I may hardcode some things, just to see their effect).

> 
> > The other (partial) solution that immediately came up my mind was to
> check
> > /be able to configure if we really *need* (or should use)
> transactions.
> If
> > not, there is no problem in interleaving messages and so we would not
> need
> > to
> > make sure the batch is being processed atomically and thus we could
> release
> > the lock while generating messages -- much like v4 does. HOWEVER,
> that
> also
> > means we have much more locking activity, and so lock contention
> begins
> to
> > become a concern again (somehow going in circles, here).
> >
> > Maybe it would be good to have both capabilities and let the operator
> > decide
> > which one suites best. For starters, I will probably just try to
> change
> the
> > string generation phase to use a big buffer and place that outside of
> the
> > atomic transaction code. While I have quite some concerns, outlined
> above,
> > I
> > find it interesting to see the effect this has on the overall
> performance
> > (when used with reasonable batch sizes). This may be useful as a
> guideline
> > for further work.
> 
> we already allow the transaction size to be configured down to 1
> (effectively disabled) so there's no need for a tunable there.
> 
> But if the string generation is being done while the lock is being held
> it
> should definitely be moved outside of the lock.

I agree, but remember the problems I saw in my first post.  Anyhow, we are
now looking at them. I also think it makes a lot of sense to provide
different code pathes for plugins that actually need a transactional
interface (like database outputs) and those that are not actually
transactional (like file output). The later code path can contain some
optimization over the first.

Rainer
> 
> David Lang
> 
> > As usual, comments are greatly appreciated.
> >
> > Rainer
> >
> >> -----Original Message-----
> >> From: [email protected] [mailto:rsyslog-
> >> [email protected]] On Behalf Of Aaron Wiebe
> >> Sent: Monday, June 07, 2010 7:17 PM
> >> To: rsyslog-users
> >> Subject: Re: [rsyslog] discussion request: performance enhancement
> for
> >> imtcp
> >>
> >> Greetings,
> >>
> >> >> hmm, this could be locking overhead as well. One thing that you
> did
> >> >> early
> >> >> in v5 (I don't think it made it into v4) was to allow the UDP
> >> receiver
> >> >> to
> >> >> insert multiple messages into the queue at once. That made a huge
> >> >> difference.
> >> >
> >> > No, I think that was something I did to both versions. At some
> time,
> >> I did
> >> > optimizations to both v4 and v5, things like reducing copies,
> >> reducing malloc
> >> > calls and so on. I am pretty sure submission batching was among
> them.
> >>
> >> I agree with David actually.  While multiple tcp threads on the
> input
> >> side certainly would be helpful, I believe the locking overhead is
> >> likely the real culprit behind the inability to fully utilize a
> >> multi-core machine with a single instance of rsyslog.  In my
> >> experience, while the input thread was certainly relatively busy,
> the
> >> thread itself wasn't hitting a cpu bottleneck.  Reducing some of the
> >> latencies around queuing and context switching is probably the best
> >> place to spend time if the goal is improved performance.  The
> earlier
> >> investigations into lockless queues combined with some batching may
> >> help to address these.  As it stands, I don't regularly see specific
> >> threads hitting cpu bottlenecks (assuming top -H is accurate).
> >>
> >> Also, if that is the problem (queues and context switching), adding
> >> further division of work into imtcp may actually make the problem
> >> worse.  That said, I'm not against reducing possible bottlenecks to
> >> get into the 1-10 gig input levels (at which this would probably
> >> become an issue) - but I think the queues should be more closely
> >> examined first.
> >>
> >> -Aaron
> >> _______________________________________________
> >> rsyslog mailing list
> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> http://www.rsyslog.com
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Reply via email to