Re: [rsyslog] discussion request: performance enhancement for imtcp

Rainer Gerhards Mon, 07 Jun 2010 22:22:46 -0700

David,

thanks for the comments, they are quite useful. A real answer will follow,
but I am currently runing a test and would like to see that completed. I am
creating a very special version (not for actual use) that will remove the
action locking for omfile. As a test, that should be fairly easy (but there
*are* some subleties), and it would provide us some good numbers to work on.


Just some thing I'd like to verify: as of my understanding, once a write()
request has been issued, the OS will make sure that this write is handled in
an atomic manner, even though another thread may be issuing another write for
the same fd in parallel and thread one's timeslice is exhausted. I am pretty
sure this is the case, but as it becomes fundamental, I'd like to get some
confirmation from others.

Thanks,
Rainer 

> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> Sent: Monday, June 07, 2010 11:49 PM
> To: rsyslog-users
> Cc: Rainer Gerhards
> Subject: Re: [rsyslog] discussion request: performance enhancement for
> imtcp
> 
> First off a comment that this may not make it through to the list. I'm
> having to use webmail to access my home e-mail from work and I got
> bounced
> earlier today.
> 
> On Mon, 7 Jun 2010 22:11:37 +0200, "Rainer Gerhards"
> <[email protected]> wrote:
> > Aaron, David,
> >
> > thanks again for the feedback. I did some testing the past few hours,
> but
> > all
> > of them on v5, so the number may be quite different for v4.
> >
> > Based on what I have seen, the issue seems not to be lock contention,
> but
> > rather locking quite long code sequences. In v5, we introduced
> batching,
> > but
> > we also (somewhat later) introduced transactions. A transaction,
> however,
> > is
> > atomic, and so rsyslog must make sure that no two transactions are
> mixed
> > together. To do so, the engine applies the action lock when the batch
> is
> > being processed and releases it when the batch is completely
> processed.
> > During that processing, not only the action is being called (which
> would
> be
> > fine), but also the to-be-passed strings are generated. However, if
> there
> > is
> > a quick filter "in front of the action" (as *.* is!) this is the
> majority
> > of
> > work that is being carried out. So in essence, due to transaction
> support,
> > there can not be more than one worker thread in that large code
> sequence. I
> > have tested on a quad core machine, and I can't get any more than 1.3
> CPUs
> > utilized when I have a single input and a single rule
> >
> > *.* /path/to/file
> >
> > I have then *completely* removed omfile processing (by commenting out
> > everything in doAction), and I still can't get up to more than 1.5
> CPUs
> > being
> > utilized, all of this with 4 worker thread defined (the imtcp input
> uses
> > around 10 to 15% of a single CPU).
> 
> could this be worked around by using multiple rulesets so that the main
> queue is not locked as long (or does this just result in the secondary
> queues getting blocked)
> 
> I thought that the transaction support worked like this (which would
> not
> cause the problem you are describing)
> 
> lock queue
> mark N messages as being worked on
> unlock queue
> process messages
> lock queue
> if successful
>   mark the messages tagged above as being completed (and therefor
> available to be removed, which may remove them)
> else
>   mark the messages tagged above as not processed
> unlock queue
> 
> you don't need to hold the lock while processing the records, only when
> tagging them (to make sure another thread doesn't tag them as well)
> 
> from your description of the process it sounds like what is happening
> is
> 
> lock queue
> mark N messages as being worked on
> create output strings for N messages
> unlock queue
> send messages
> lock queue
> if successful
>   mark the messages tagged above as being completed (and therefor
> available to be removed, which may remove them)
> else
>   mark the messages tagged above as not processed
> unlock queue
> 
> the one thing you need to watch out for here is that you don't move
> messages around while they are live (which will cause a little bit of
> fragmentation)
> 
> If this is a problem you can create a separate lock for modifying
> existing
> messages
> 
> the locking rules would be something like
> 
> Allow an unlimited number of worker threads to hold it as a 'read' lock
> A process trying to modify (defragment) the queue would try to get it
> as a
> 'write' lock.
> If there are any readers holding the lock the writer stalls waiting for
> them
> If there is a process that has indicated that it wants the write lock,
> new
> readers block waiting for the writer
> 
> as long as the defragmentation runs are fairly rare compared to normal
> operation this will be very efficient.
> 
> But I suspect that there is not really a need to do the defragmentation
> like this, just accept that some space may be wasted until the oldest
> messages finish being processed.
> 
> > So this seem to be the culprit as least as far as v5 is involved. For
> v4, I
> > am pretty sure I will get a totally different picture, because there
> is
> no
> > such coarse locking and the string generation can be run in parallel
> (but
> > there we have lots of locking activity and contention). V5 seems to
> be
> so
> > much faster, that this effect did not really surface.
> >
> > So what is the cure? An easy way would be to generate the strings for
> the
> > entire batch before beginning to process it. However, there is a
> reason
> > this
> > so far is not done: especially with large batches, that would require
> large
> > amount of memory. For example, let's say we have a batch of 1,000
> messages
> > and each string is 200 bytes long. So we would need to use 200,000
> bytes,
> > or
> > roughly 200k to hold just the strings, where we currently use only
> 200
> (but
> > repeatedly overwrite it). That amount probably has big impact on the
> cache
> > hit ratio: at worst, the first messages will be evicted from cache by
> the
> > later messages. And as a batch is being processed from begin to end,
> each
> > message will evict some strings that we will soon need. The end
> result
> > could
> > be really bad. For smaller batch sizes, of course, that would not be
> as
> > much
> > of a problem.
> 
> I'm not sure how much of a problem this would be. it's not like you
> will
> be repeatedly accessing the same strings, it's just that by the time
> you
> get to the end of the string the beginning may no longer be in the
> cache.
> 
> this is inefficient from the cache point of view, but it may still be a
> win overall if there is a high transaction overhead in the output. In
> addition, depending on the output type you may not end up having to
> read
> the data back into the cache at all (for example, the system may be
> able to
> DMA the data directly to the output device, network card or disk
> controller)
> 
> the number of messages to try and batch is already tunable, if it's set
> low you use small strings which fit in cache, if large you don't fit in
> cache, but may save elsewhere.
> 
> > The other (partial) solution that immediately came up my mind was to
> check
> > /be able to configure if we really *need* (or should use)
> transactions.
> If
> > not, there is no problem in interleaving messages and so we would not
> need
> > to
> > make sure the batch is being processed atomically and thus we could
> release
> > the lock while generating messages -- much like v4 does. HOWEVER,
> that
> also
> > means we have much more locking activity, and so lock contention
> begins
> to
> > become a concern again (somehow going in circles, here).
> >
> > Maybe it would be good to have both capabilities and let the operator
> > decide
> > which one suites best. For starters, I will probably just try to
> change
> the
> > string generation phase to use a big buffer and place that outside of
> the
> > atomic transaction code. While I have quite some concerns, outlined
> above,
> > I
> > find it interesting to see the effect this has on the overall
> performance
> > (when used with reasonable batch sizes). This may be useful as a
> guideline
> > for further work.
> 
> we already allow the transaction size to be configured down to 1
> (effectively disabled) so there's no need for a tunable there.
> 
> But if the string generation is being done while the lock is being held
> it
> should definitely be moved outside of the lock.
> 
> David Lang
> 
> > As usual, comments are greatly appreciated.
> >
> > Rainer
> >
> >> -----Original Message-----
> >> From: [email protected] [mailto:rsyslog-
> >> [email protected]] On Behalf Of Aaron Wiebe
> >> Sent: Monday, June 07, 2010 7:17 PM
> >> To: rsyslog-users
> >> Subject: Re: [rsyslog] discussion request: performance enhancement
> for
> >> imtcp
> >>
> >> Greetings,
> >>
> >> >> hmm, this could be locking overhead as well. One thing that you
> did
> >> >> early
> >> >> in v5 (I don't think it made it into v4) was to allow the UDP
> >> receiver
> >> >> to
> >> >> insert multiple messages into the queue at once. That made a huge
> >> >> difference.
> >> >
> >> > No, I think that was something I did to both versions. At some
> time,
> >> I did
> >> > optimizations to both v4 and v5, things like reducing copies,
> >> reducing malloc
> >> > calls and so on. I am pretty sure submission batching was among
> them.
> >>
> >> I agree with David actually.  While multiple tcp threads on the
> input
> >> side certainly would be helpful, I believe the locking overhead is
> >> likely the real culprit behind the inability to fully utilize a
> >> multi-core machine with a single instance of rsyslog.  In my
> >> experience, while the input thread was certainly relatively busy,
> the
> >> thread itself wasn't hitting a cpu bottleneck.  Reducing some of the
> >> latencies around queuing and context switching is probably the best
> >> place to spend time if the goal is improved performance.  The
> earlier
> >> investigations into lockless queues combined with some batching may
> >> help to address these.  As it stands, I don't regularly see specific
> >> threads hitting cpu bottlenecks (assuming top -H is accurate).
> >>
> >> Also, if that is the problem (queues and context switching), adding
> >> further division of work into imtcp may actually make the problem
> >> worse.  That said, I'm not against reducing possible bottlenecks to
> >> get into the 1-10 gig input levels (at which this would probably
> >> become an issue) - but I think the queues should be more closely
> >> examined first.
> >>
> >> -Aaron
> >> _______________________________________________
> >> rsyslog mailing list
> >> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >> http://www.rsyslog.com
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] discussion request: performance enhancement for imtcp

Reply via email to