On Mon, 7 Jun 2010, Rainer Gerhards wrote: >> -----Original Message----- >> From: [email protected] [mailto:rsyslog- >> [email protected]] On Behalf Of [email protected] > >> I'm surprised to see this as a problem (especially as my experiance has >> been that the bottlenecks are on the output side, not the input side) >> >> the data is serialized as it arrives over the wire (at least if you >> have a >> single ethernet port in use), and with epoll I would expect a single >> thread to have no problem pulling the data from the network stack and >> putting it somewhere. >> > > At least this is a problem I got from some high performance sites. They had > in common that the actual rule processing was very, very easy, like a *.* > filter and just write to file actions. These are *extremely fast* (if you > disagree, please do so on list, I would be very interested in that).
in my experiance (with v5 and UDP messages) the thread that receives the messages is ~20% of the cpu utilization of the thread that writes the messages, even with a simple ruleset (mine is typically *.* /var/log/messages on the central boxes as well) > BUT I need to mention that this was in v4, unfortunately not in v5. That > meant that the event handling was done by select() and with select's bad > performance for larger connection sets, that may be the culprit. HOWEVER, I > got from the reports that the CPUs were NOT saturated (and the message rate > lower) when listeners were run inside a single instance, but CPUs got > saturated (and the message rate higher) when a couple of rsyslog instances > ran. The only explanation I have for this is that the single instance > actually did not manage to pull the data from the operating system buffers. hmm, this could be locking overhead as well. One thing that you did early in v5 (I don't think it made it into v4) was to allow the UDP receiver to insert multiple messages into the queue at once. That made a huge difference. > I also failed to ask if the machines had multiple NICs, what would some more > explain the effect seen. > > I myself unfortunately seem to have an insufficient lab environment to see > this effect, that makes it a bit hard for me to judge. see if they can do a strace of the various threads for a few seconds under high load. also, can they get you a tcpdump for a few seconds so you can see the number of sources, connections, etc? >> I think that more research needs to be done on what is eating up the >> time >> in your test cases. >> >> If it's DNS lookups, they can be disabled (and/or a >> name cache can be created as we have discussed before) > > That was for the cases I have seen hardly an issue -- many message were sent > over each connection and for tcp the DNS lookup is only done during > connection setup. Still it is a good reminder to finish that part of the code > (full dns cache). good point >> It may be that the parsing that's being done is what's taking the time >> here, so I would consider soemthing like the following >> >> one thread to pull the data from the wire and dispatch it to N worker >> threads that would parse the message and put the result into the main >> queue. > > a) for v5 and some of v4, parsing is done no longer on the input side (and > thus runs via a worker pool, main queue worker pool to be precise) > > b) this architecture requires more context switches, something I would really > like to avoid. I guess it would even lead to far worse performance in the > single listener case. >> >> >> even late last year with UDP messages I was able to saturate a Gig-E >> network with packets and receive them with <25% of a single cpu. I >> would >> not expect that TCP would have noticably more overhead. > > I fully agree, definitely far less (just think that I do need to do one API > call for each message with UDP, while a can receive a hundreds of messages > with a single API call in the case of TCP -- depending on receive buffer and > message size). so where is the time being spent? high performace http servers serving static content can do hundreds of thousands of connections in a single thread and saturate gig-E while doing so, this is more processing than rsyslog should have to do, so I am having trouble believing that you need to go to multiple threads to handle the input side of things. David Lang _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com

