On 05/16/2018 10:08 PM, David Lang wrote:
On Wed, 16 May 2018, Rich Megginson wrote:
On 05/16/2018 05:58 PM, David Lang wrote:
there's no need to add this extra complexity (multiple rulesets and
queues)
What should be happening (on any output module) is:
submit a batch.
If rejected with a soft error, retry/suspend the output
retry of the entire batch? see below
if batch-size=1 and a hard error, send to errorfile
if rejected with a hard error resubmit half of the batch
But what if 90% of the batch was successfully added? Then you are
needlessly resubmitting many of the records in the batch.
when submitting batches, you get a success/fail for the batch as a
whole (for 99% of things that actually allow you to insert in batches),
For Elasticsearch - yes, there is a top level "errors" field in the
response with a binary value true or false. true means all records in
the batch were successfully processed. false means _at least one_ record
in the batch was not processed successfully. For example, in a batch of
10000 records, you will get an response of "errors": true if 9999 of
those records were successfully processed.
so you don't know what message failed.
You do know exactly which record failed and in most cases what the error
was. Here is an example from the fluent-plugin-elasticsearch unit test:
https://github.com/uken/fluent-plugin-elasticsearch/blob/master/test/plugin/test_elasticsearch_error_handler.rb#L88
This is what the response looks like coming from Elasticsearch. You get
a separate response item for every record submitted in the bulk
request. In addition, you are guaranteed that the order of the items in
the response is exactly the same as the order of the items submitted in
the bulk request, so that you can exactly correlate the request object
with the response.
This is a database transaction (again, in most cases),
Not in Elasticsearch at the bulk index level. Probably at the very low
level where lucene hits the disk.
so if a batch fails, all you can do is bisect to figure out what
message fails. If the endpoint is inserting some of the messages from
a batch that fails, that's usually a bad thing.
now, if ES batch mode isn't an ACID transaction and it accepts some
messages and then tells you which ones failed,
It does
then you can mark the ones accepted as done and just retry the ones
that fail.
That's what I'm proposing.
But there's still no need for a separate ruleset and queue. In
Rsyslog, if an output cannot accept a message and there's reason to
think that it will in the future, then you suspend that output and try
again later. If you have reason to believe that the message is never
going to be able to be delivered, then you need to fail the message or
you will be stuck forever. This is what the error output was made for.
So how would that work on a per-record basis?
Would this be something different than using MsgConstruct -> set fields
in msg from original request -> ratelimitAddMsg for each record to resubmit?
If using the "index" (default) bulk type, this causes duplicate
records to be added.
If using the "create" type (and you have assigned a unique _id), you
will get back many 409 Duplicate errors.
This causes problems - we know because this is how the fluentd plugin
used to work, which is why we had to change it.
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
"Bulk Rejections"
"It is much better to handle queuing in your application by
gracefully handling the back pressure from a full queue. When you
receive bulk rejections, you should take these steps:
Pause the import thread for 3–5 seconds.
Extract the rejected actions from the bulk response, since it is
probable that many of the actions were successful. The bulk response
will tell you which succeeded and which were rejected.
Send a new bulk request with just the rejected actions.
Repeat from step 1 if rejections are encountered again.
Using this procedure, your code naturally adapts to the load of your
cluster and naturally backs off.
"
Does it really accept some and reject some in a random manner? or is
it a matter of accepting the first X and rejecting any after that
point? The first is easier to deal with.
It appears to be random. So you may get a failure from the first record
in the batch and the last record in the batch, and success for the
others. Or vice versa. There appear to be many, many factors in the
tuning, hardware, network, etc. that come into play.
There isn't an easy way to deal with this :P
Batch mode was created to be able to more efficiently process messages
that are inserted into databases, we then found that the reduced queue
congestion was a significant advantage in itself.
But unless you have a queue just for the ES action,
That's what we had to do for the fluentd case - we have a separate "ES
retry queue". One of the tricky parts is that there may be multiple
outputs - you may want to send each log record to Elasticsearch _and_ a
message bus _and_ a remote rsyslog forwarder. But you only want to retry
sending to Elasticsearch to avoid duplication in the other outputs.
doing queue manipulation isn't possible, all you can do is succeed or
fail, and if you fail, the retry logic will kick in.
Rainer is going to need to comment on this.
David Lang
repeat
all that should be needed is to add tests into omelasticsearch to
detect the soft errors and turn them into retries (or suspend the
output as appropriate)
David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.