On 05/16/2018 10:08 PM, David Lang wrote:
On Wed, 16 May 2018, Rich Megginson wrote:

On 05/16/2018 05:58 PM, David Lang wrote:
there's no need to add this extra complexity (multiple rulesets and queues)

What should be happening (on any output module) is:

submit a batch.
   If rejected with a soft error, retry/suspend the output

retry of the entire batch?  see below

if batch-size=1 and a hard error, send to errorfile
   if rejected with a hard error resubmit half of the batch

But what if 90% of the batch was successfully added?  Then you are needlessly resubmitting many of the records in the batch.

when submitting batches, you get a success/fail for the batch as a whole (for 99% of things that actually allow you to insert in batches),

For Elasticsearch - yes, there is a top level "errors" field in the response with a binary value true or false.  true means all records in the batch were successfully processed. false means _at least one_ record in the batch was not processed successfully.  For example, in a batch of 10000 records, you will get an response of "errors": true if 9999 of those records were successfully processed.

so you don't know what message failed.

You do know exactly which record failed and in most cases what the error was.  Here is an example from the fluent-plugin-elasticsearch unit test: https://github.com/uken/fluent-plugin-elasticsearch/blob/master/test/plugin/test_elasticsearch_error_handler.rb#L88 This is what the response looks like coming from Elasticsearch.  You get a separate response item for every record submitted in the bulk request.  In addition, you are guaranteed that the order of the items in the response is exactly the same as the order of the items submitted in the bulk request, so that you can exactly correlate the request object with the response.

This is a database transaction (again, in most cases),

Not in Elasticsearch at the bulk index level.  Probably at the very low level where lucene hits the disk.

so if a batch fails, all you can do is bisect to figure out what message fails. If the endpoint is inserting some of the messages from a batch that fails, that's usually a bad thing.

now, if ES batch mode isn't an ACID transaction and it accepts some messages and then tells you which ones failed,

It does

then you can mark the ones accepted as done and just retry the ones that fail.

That's what I'm proposing.

But there's still no need for a separate ruleset and queue. In Rsyslog, if an output cannot accept a message and there's reason to think that it will in the future, then you suspend that output and try again later. If you have reason to believe that the message is never going to be able to be delivered, then you need to fail the message or you will be stuck forever. This is what the error output was made for.

So how would that work on a per-record basis?

Would this be something different than using MsgConstruct -> set fields in msg from original request -> ratelimitAddMsg for each record to resubmit?

If using the "index" (default) bulk type, this causes duplicate records to be added. If using the "create" type (and you have assigned a unique _id), you will get back many 409 Duplicate errors. This causes problems - we know because this is how the fluentd plugin used to work, which is why we had to change it.

"Bulk Rejections"
"It is much better to handle queuing in your application by gracefully handling the back pressure from a full queue. When you receive bulk rejections, you should take these steps:

    Pause the import thread for 3–5 seconds.
    Extract the rejected actions from the bulk response, since it is probable that many of the actions were successful. The bulk response will tell you which succeeded and which were rejected.
    Send a new bulk request with just the rejected actions.
    Repeat from step 1 if rejections are encountered again.

Using this procedure, your code naturally adapts to the load of your cluster and naturally backs off.

Does it really accept some and reject some in a random manner? or is it a matter of accepting the first X and rejecting any after that point? The first is easier to deal with.

It appears to be random.  So you may get a failure from the first record in the batch and the last record in the batch, and success for the others.  Or vice versa.  There appear to be many, many factors in the tuning, hardware, network, etc. that come into play.

There isn't an easy way to deal with this :P

Batch mode was created to be able to more efficiently process messages that are inserted into databases, we then found that the reduced queue congestion was a significant advantage in itself.

But unless you have a queue just for the ES action,

That's what we had to do for the fluentd case - we have a separate "ES retry queue".  One of the tricky parts is that there may be multiple outputs - you may want to send each log record to Elasticsearch _and_ a message bus _and_ a remote rsyslog forwarder. But you only want to retry sending to Elasticsearch to avoid duplication in the other outputs.

doing queue manipulation isn't possible, all you can do is succeed or fail, and if you fail, the retry logic will kick in.

Rainer is going to need to comment on this.

David Lang


all that should be needed is to add tests into omelasticsearch to detect the soft errors and turn them into retries (or suspend the output as appropriate)

David Lang

rsyslog mailing list
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 

Reply via email to