Date: Mon, 2 Feb 2015 11:21:29 +0100
From: Rainer Gerhards <[email protected]>
To: David Lang <[email protected]>
Cc: rsyslog-users <[email protected]>
Subject: Re: [rsyslog] RELP performance was: Re: [RFC: Ingestion Relay]
End-to-end reliable 'at-least-once' message delivery at large scale
Answering this on request of David. I don't want to dig down into
optimizing RELP, though, as there is so much others going on that I can't
do any work on it in any case.
Facts inline below.
2015-01-25 3:43 GMT+01:00 David Lang <[email protected]>:
On Sat, 24 Jan 2015, David Lang wrote:
- RELP or not
* My benchmarks on a 1G 20-core box achieved maximum of 27 MB/s
(uncompressed) with RELP with multiple producers. Compare it with ptcp:
98
MB/s uncompressed, 220 MB/s with stream-compression. So efficiency is
one
of the points of contention.
I think I know what's going on here. If I'm correct, it will require some
"interesting" changes to how the internal queue is used (although similar
changes may happen as part of the pull work that Rainer is working on)
Rainer, please check my logic.
the worker thread locks the queue and marks a batch of messages as being
worked on.
The worker thread gets to the RELP action and starts sending these
messages, and handling acks as they arrive.
The worker thread cannot consider the RELP action a success and go on to
the next part of the config, or the next batch of messasges until all the
messages in this batch have been acked (or some fail)
That's not the case. It's RELP that will ensure that all messages are
processed. Once they are handed over to librelp, and the call returns, the
worker can be (sufficiently) sure that librelp will deliver them. So the
queue processing will be pushed back only when librelp blocks. It all
depends on relp window size.
Note that we identified that it is possible to lose up to relp windows size
messages if rsyslog goes down in a failure situation. We already talked
about cures (callbacks to saving in such cases), but nobody so far had time
to implement this (nor did someone really complain hard about that).
This means that the window of unacked messages cannot be larger than the
size of the batch of messages being processed. Even if there are no other
actions that the worker is doing on this batch of messages, this causes the
output of messages to 'throb', with a bunch of messages being sent, and
then a pause when all the messages of this batch have been sent, but not
all acks have been received, then the next batch is processed and the cycle
repeats. If there are other actions that take place from the same queue,
the throbbing is even worse because all those other actions take place
between the time that the RELP module is sending messages.
no, batch begin is just sent as a hint to librelp (which does some
optimization I don't remember out of my head based on that -- I think it is
related to filling up buffers).
If there are multiple RELP actions, it gets even worse, because we can't
start sending on the second RELP action until the first RELP action says
the message is delivered (just in case there is logic in place to do
something different if the delivery fails on the first action)
There is no relationship between the two.
I guess the performance degradion in contrast to TCP stems back to the fact
that TCP has a much larger window (in terms of messages). Try setting the
relp window to 100,000 messages and I guess the performance is much more
equal -- but that of course also means you may lose as many messages.
Rainer
First question, is my logic on what's going on plausible.
If so, how can this be fixed?
One way I see to fix this, a worker thread would have to be able to
'complete' a batch and start working on the next batch without marking all
the messages in the first batch as being delivered (because we are waiting
for acks that may never come, they aren't delivered yet). And if there are
multiple RELP actions, it can't mark the message as being delivered and
able to be removed from the queue until all of the actions have acked the
message.
The other possible way of handling this is to have the RELP output module
maintain it's own internal queue, and let the worker thread delete the
message from the action queue as soon as the RELP action starts working on
it, and if the RELP action can't deliver it, the message would have to be
re-added to the
action queue from the internal queue. but at that point, I don't see how
to sanely handle the "do this if the prior action failed" type of logic,
and the log message would end up getting re-delivered to all other outputs.
Whatever is being done for the pull output module may be able to help
here, because the pull output has a similar problem, you don't want to hold
up delivery of other messages just because the clients haven't requested
the messages yet. but for the pull output, rsyslog can consider the message
delivered as soon as it's handed off to the pull module.
David Lang