Re: [rsyslog] omelasticsearch - failed operation handling

Rich Megginson via rsyslog Thu, 17 May 2018 07:32:00 -0700

On 05/16/2018 10:08 PM, David Lang wrote:

On Wed, 16 May 2018, Rich Megginson wrote:
On 05/16/2018 05:58 PM, David Lang wrote:
there's no need to add this extra complexity (multiple rulesets andqueues)
What should be happening (on any output module) is:

submit a batch.
   If rejected with a soft error, retry/suspend the output
retry of the entire batch?  see below
if batch-size=1 and a hard error, send to errorfile
   if rejected with a hard error resubmit half of the batch
But what if 90% of the batch was successfully added? Then you areneedlessly resubmitting many of the records in the batch.
when submitting batches, you get a success/fail for the batch as awhole (for 99% of things that actually allow you to insert in batches),

For Elasticsearch - yes, there is a top level "errors" field in theresponse with a binary value true or false. true means all records inthe batch were successfully processed. false means _at least one_ recordin the batch was not processed successfully. For example, in a batch of10000 records, you will get an response of "errors": true if 9999 ofthose records were successfully processed.

so you don't know what message failed.

You do know exactly which record failed and in most cases what the errorwas. Here is an example from the fluent-plugin-elasticsearch unit test:https://github.com/uken/fluent-plugin-elasticsearch/blob/master/test/plugin/test_elasticsearch_error_handler.rb#L88This is what the response looks like coming from Elasticsearch. You geta separate response item for every record submitted in the bulkrequest. In addition, you are guaranteed that the order of the items inthe response is exactly the same as the order of the items submitted inthe bulk request, so that you can exactly correlate the request objectwith the response.

This is a database transaction (again, in most cases),

Not in Elasticsearch at the bulk index level. Probably at the very lowlevel where lucene hits the disk.

so if a batch fails, all you can do is bisect to figure out whatmessage fails. If the endpoint is inserting some of the messages froma batch that fails, that's usually a bad thing.
now, if ES batch mode isn't an ACID transaction and it accepts somemessages and then tells you which ones failed,


It does

then you can mark the ones accepted as done and just retry the onesthat fail.


That's what I'm proposing.

But there's still no need for a separate ruleset and queue. InRsyslog, if an output cannot accept a message and there's reason tothink that it will in the future, then you suspend that output and tryagain later. If you have reason to believe that the message is nevergoing to be able to be delivered, then you need to fail the message oryou will be stuck forever. This is what the error output was made for.


So how would that work on a per-record basis?

Would this be something different than using MsgConstruct -> set fieldsin msg from original request -> ratelimitAddMsg for each record to resubmit?

If using the "index" (default) bulk type, this causes duplicaterecords to be added.If using the "create" type (and you have assigned a unique _id), youwill get back many 409 Duplicate errors.This causes problems - we know because this is how the fluentd pluginused to work, which is why we had to change it.
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
"Bulk Rejections"
"It is much better to handle queuing in your application bygracefully handling the back pressure from a full queue. When youreceive bulk rejections, you should take these steps:
    Pause the import thread for 3–5 seconds.
Extract the rejected actions from the bulk response, since it isprobable that many of the actions were successful. The bulk responsewill tell you which succeeded and which were rejected.
    Send a new bulk request with just the rejected actions.
    Repeat from step 1 if rejections are encountered again.
Using this procedure, your code naturally adapts to the load of yourcluster and naturally backs off.
"
Does it really accept some and reject some in a random manner? or isit a matter of accepting the first X and rejecting any after thatpoint? The first is easier to deal with.

It appears to be random. So you may get a failure from the first recordin the batch and the last record in the batch, and success for theothers. Or vice versa. There appear to be many, many factors in thetuning, hardware, network, etc. that come into play.


There isn't an easy way to deal with this :P

Batch mode was created to be able to more efficiently process messagesthat are inserted into databases, we then found that the reduced queuecongestion was a significant advantage in itself.
But unless you have a queue just for the ES action,

That's what we had to do for the fluentd case - we have a separate "ESretry queue". One of the tricky parts is that there may be multipleoutputs - you may want to send each log record to Elasticsearch _and_ amessage bus _and_ a remote rsyslog forwarder. But you only want to retrysending to Elasticsearch to avoid duplication in the other outputs.

doing queue manipulation isn't possible, all you can do is succeed orfail, and if you fail, the retry logic will kick in.
Rainer is going to need to comment on this.

David Lang
repeat
all that should be needed is to add tests into omelasticsearch todetect the soft errors and turn them into retries (or suspend theoutput as appropriate)
David Lang


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] omelasticsearch - failed operation handling

Reply via email to