> -----Original Message----- > From: [email protected] [mailto:rsyslog- > [email protected]] On Behalf Of [email protected] > Sent: Monday, June 07, 2010 6:11 PM > To: rsyslog-users > Subject: Re: [rsyslog] discussion request: performance enhancement for > imtcp > > On Mon, 7 Jun 2010, Rainer Gerhards wrote: > > >> -----Original Message----- > >> From: [email protected] [mailto:rsyslog- > >> [email protected]] On Behalf Of [email protected] > > > >> I'm surprised to see this as a problem (especially as my experiance > has > >> been that the bottlenecks are on the output side, not the input > side) > >> > >> the data is serialized as it arrives over the wire (at least if you > >> have a > >> single ethernet port in use), and with epoll I would expect a single > >> thread to have no problem pulling the data from the network stack > and > >> putting it somewhere. > >> > > > > At least this is a problem I got from some high performance sites. > They had > > in common that the actual rule processing was very, very easy, like a > *.* > > filter and just write to file actions. These are *extremely fast* (if > you > > disagree, please do so on list, I would be very interested in that). > > in my experiance (with v5 and UDP messages) the thread that receives > the > messages is ~20% of the cpu utilization of the thread that writes the > messages, even with a simple ruleset (mine is typically *.* > /var/log/messages on the central boxes as well)
that's interesting. I'll try to see if I can reproduce similar behavior. Do you have a chance to do a quick test with TCP in your lab? The input should be even further down (for lower, I'd expect). > > BUT I need to mention that this was in v4, unfortunately not in v5. > That > > meant that the event handling was done by select() and with select's > bad > > performance for larger connection sets, that may be the culprit. > HOWEVER, I > > got from the reports that the CPUs were NOT saturated (and the > message rate > > lower) when listeners were run inside a single instance, but CPUs got > > saturated (and the message rate higher) when a couple of rsyslog > instances > > ran. The only explanation I have for this is that the single instance > > actually did not manage to pull the data from the operating system > buffers. > > hmm, this could be locking overhead as well. One thing that you did > early > in v5 (I don't think it made it into v4) was to allow the UDP receiver > to > insert multiple messages into the queue at once. That made a huge > difference. No, I think that was something I did to both versions. At some time, I did optimizations to both v4 and v5, things like reducing copies, reducing malloc calls and so on. I am pretty sure submission batching was among them. > > > I also failed to ask if the machines had multiple NICs, what would > some more > > explain the effect seen. > > > > I myself unfortunately seem to have an insufficient lab environment > to see > > this effect, that makes it a bit hard for me to judge. > > see if they can do a strace of the various threads for a few seconds > under > high load. > > also, can they get you a tcpdump for a few seconds so you can see the > number of sources, connections, etc? I'll try to obtain more info. > > >> I think that more research needs to be done on what is eating up the > >> time > >> in your test cases. > >> > >> If it's DNS lookups, they can be disabled (and/or a > >> name cache can be created as we have discussed before) > > > > That was for the cases I have seen hardly an issue -- many message > were sent > > over each connection and for tcp the DNS lookup is only done during > > connection setup. Still it is a good reminder to finish that part of > the code > > (full dns cache). > > good point > > >> It may be that the parsing that's being done is what's taking the > time > >> here, so I would consider soemthing like the following > >> > >> one thread to pull the data from the wire and dispatch it to N > worker > >> threads that would parse the message and put the result into the > main > >> queue. > > > > a) for v5 and some of v4, parsing is done no longer on the input side > (and > > thus runs via a worker pool, main queue worker pool to be precise) > > > > b) this architecture requires more context switches, something I > would really > > like to avoid. I guess it would even lead to far worse performance in > the > > single listener case. > >> > >> > >> even late last year with UDP messages I was able to saturate a Gig-E > >> network with packets and receive them with <25% of a single cpu. I > >> would > >> not expect that TCP would have noticably more overhead. > > > > I fully agree, definitely far less (just think that I do need to do > one API > > call for each message with UDP, while a can receive a hundreds of > messages > > with a single API call in the case of TCP -- depending on receive > buffer and > > message size). > > so where is the time being spent? > > high performace http servers serving static content can do hundreds of > thousands of connections in a single thread and saturate gig-E while > doing > so, this is more processing than rsyslog should have to do, so I am > having > trouble believing that you need to go to multiple threads to handle the > input side of things. That's a very convincing argument. I'll go back in my cycle to find evidence of the problem and then see if I am addressing the proper culprit. So it looks like some other optimization is going to come first ;) Thanks, Rainer _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

