Re: [rsyslog] discussion request: performance enhancement for imtcp

Rainer Gerhards Mon, 07 Jun 2010 09:23:30 -0700

> -----Original Message-----
> From: [email protected] [mailto:rsyslog-
> [email protected]] On Behalf Of [email protected]
> Sent: Monday, June 07, 2010 6:11 PM
> To: rsyslog-users
> Subject: Re: [rsyslog] discussion request: performance enhancement for
> imtcp
> 
> On Mon, 7 Jun 2010, Rainer Gerhards wrote:
> 
> >> -----Original Message-----
> >> From: [email protected] [mailto:rsyslog-
> >> [email protected]] On Behalf Of [email protected]
> >
> >> I'm surprised to see this as a problem (especially as my experiance
> has
> >> been that the bottlenecks are on the output side, not the input
> side)
> >>
> >> the data is serialized as it arrives over the wire (at least if you
> >> have a
> >> single ethernet port in use), and with epoll I would expect a single
> >> thread to have no problem pulling the data from the network stack
> and
> >> putting it somewhere.
> >>
> >
> > At least this is a problem I got from some high performance sites.
> They had
> > in common that the actual rule processing was very, very easy, like a
> *.*
> > filter and just write to file actions. These are *extremely fast* (if
> you
> > disagree, please do so on list, I would be very interested in that).
> 
> in my experiance (with v5 and UDP messages) the thread that receives
> the
> messages is ~20% of the cpu utilization of the thread that writes the
> messages, even with a simple ruleset (mine is typically *.*
> /var/log/messages on the central boxes as well)


that's interesting. I'll try to see if I can reproduce similar behavior. Do
you have a chance to do a quick test with TCP in your lab? The input should
be even further down (for lower, I'd expect).

> > BUT I need to mention that this was in v4, unfortunately not in v5.
> That
> > meant that the event handling was done by select() and with select's
> bad
> > performance for larger connection sets, that may be the culprit.
> HOWEVER, I
> > got from the reports that the CPUs were NOT saturated (and the
> message rate
> > lower) when listeners were run inside a single instance, but CPUs got
> > saturated (and the message rate higher) when a couple of rsyslog
> instances
> > ran. The only explanation I have for this is that the single instance
> > actually did not manage to pull the data from the operating system
> buffers.
> 
> hmm, this could be locking overhead as well. One thing that you did
> early
> in v5 (I don't think it made it into v4) was to allow the UDP receiver
> to
> insert multiple messages into the queue at once. That made a huge
> difference.

No, I think that was something I did to both versions. At some time, I did
optimizations to both v4 and v5, things like reducing copies, reducing malloc
calls and so on. I am pretty sure submission batching was among them.

> 
> > I also failed to ask if the machines had multiple NICs, what would
> some more
> > explain the effect seen.
> >
> > I myself unfortunately seem to have an insufficient lab environment
> to see
> > this effect, that makes it a bit hard for me to judge.
> 
> see if they can do a strace of the various threads for a few seconds
> under
> high load.
> 
> also, can they get you a tcpdump for a few seconds so you can see the
> number of sources, connections, etc?

I'll try to obtain more info.

> 
> >> I think that more research needs to be done on what is eating up the
> >> time
> >> in your test cases.
> >>
> >> If it's DNS lookups, they can be disabled (and/or a
> >> name cache can be created as we have discussed before)
> >
> > That was for the cases I have seen hardly an issue -- many message
> were sent
> > over each connection and for tcp the DNS lookup is only done during
> > connection setup. Still it is a good reminder to finish that part of
> the code
> > (full dns cache).
> 
> good point
> 
> >> It may be that the parsing that's being done is what's taking the
> time
> >> here, so I would consider soemthing like the following
> >>
> >> one thread to pull the data from the wire and dispatch it to N
> worker
> >> threads that would parse the message and put the result into the
> main
> >> queue.
> >
> > a) for v5 and some of v4, parsing is done no longer on the input side
> (and
> > thus runs via a worker pool, main queue worker pool to be precise)
> >
> > b) this architecture requires more context switches, something I
> would really
> > like to avoid. I guess it would even lead to far worse performance in
> the
> > single listener case.
> >>
> >>
> >> even late last year with UDP messages I was able to saturate a Gig-E
> >> network with packets and receive them with <25% of a single cpu. I
> >> would
> >> not expect that TCP would have noticably more overhead.
> >
> > I fully agree, definitely far less (just think that I do need to do
> one API
> > call for each message with UDP, while a can receive a hundreds of
> messages
> > with a single API call in the case of TCP -- depending on receive
> buffer and
> > message size).
> 
> so where is the time being spent?
> 
> high performace http servers serving static content can do hundreds of
> thousands of connections in a single thread and saturate gig-E while
> doing
> so, this is more processing than rsyslog should have to do, so I am
> having
> trouble believing that you need to go to multiple threads to handle the
> input side of things.

That's a very convincing argument. I'll go back in my cycle to find evidence
of the problem and then see if I am addressing the proper culprit. So it
looks like some other optimization is going to come first ;)

Thanks, 
Rainer
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com

Re: [rsyslog] discussion request: performance enhancement for imtcp

Reply via email to