Re: [rsyslog] omelasticsearch - failed operation handling

Rich Megginson via rsyslog Thu, 17 May 2018 07:34:40 -0700

On 05/17/2018 05:52 AM, Brian Knox wrote:

To my knowledge, Rich is correct. This also would explain a case wehit maybe every couple of months, where rsyslog very quicklyduplicates some messages it is sending to elasticsearch. I wouldassume this would be a case where a batch is submitted, only some ofthe messages are rejected, and rsyslog then duplicates messages tryingto send the batch over and over again.

You can confirm this by monitoring the bulk index thread poolhttps://www.elastic.co/guide/en/elasticsearch/reference/2.4/cat-thread-pool.htmlto see if you are getting bulk rejections.

On Thu, May 17, 2018 at 12:08 AM David Lang <[email protected]<mailto:[email protected]>> wrote:


    On Wed, 16 May 2018, Rich Megginson wrote:

    > On 05/16/2018 05:58 PM, David Lang wrote:
    >> there's no need to add this extra complexity (multiple rulesets
    and queues)
    >>
    >> What should be happening (on any output module) is:
    >>
    >> submit a batch.
    >>    If rejected with a soft error, retry/suspend the output
    >
    > retry of the entire batch?  see below
    >
    >> if batch-size=1 and a hard error, send to errorfile
    >>    if rejected with a hard error resubmit half of the batch
    >
    > But what if 90% of the batch was successfully added? Then you
    are needlessly
    > resubmitting many of the records in the batch.

    when submitting batches, you get a success/fail for the batch as a
    whole (for
    99% of things that actually allow you to insert in batches), so
    you don't know
    what message failed. This is a database transaction (again, in
    most cases), so
    if a batch fails, all you can do is bisect to figure out what
    message fails. If
    the endpoint is inserting some of the messages from a batch that
    fails, that's
    usually a bad thing.

    now, if ES batch mode isn't an ACID transaction and it accepts
    some messages and
    then tells you which ones failed, then you can mark the ones
    accepted as done
    and just retry the ones that fail. But there's still no need for a
    separate
    ruleset and queue. In Rsyslog, if an output cannot accept a
    message and there's
    reason to think that it will in the future, then you suspend that
    output and try
    again later. If you have reason to believe that the message is
    never going to be
    able to be delivered, then you need to fail the message or you
    will be stuck
    forever. This is what the error output was made for.

    > If using the "index" (default) bulk type, this causes duplicate
    records to be
    > added.
    > If using the "create" type (and you have assigned a unique _id),
    you will get
    > back many 409 Duplicate errors.
    > This causes problems - we know because this is how the fluentd
    plugin used to
    > work, which is why we had to change it.
    >
    >
    
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_monitoring_individual_nodes.html#_threadpool_section
    > "Bulk Rejections"
    > "It is much better to handle queuing in your application by
    gracefully
    > handling the back pressure from a full queue. When you receive bulk
    > rejections, you should take these steps:
    >
    >     Pause the import thread for 3–5 seconds.
    >     Extract the rejected actions from the bulk response, since
    it is probable
    > that many of the actions were successful. The bulk response will
    tell you
    > which succeeded and which were rejected.
    >     Send a new bulk request with just the rejected actions.
    >     Repeat from step 1 if rejections are encountered again.
    >
    > Using this procedure, your code naturally adapts to the load of
    your cluster
    > and naturally backs off.
    > "

    Does it really accept some and reject some in a random manner? or
    is it a matter
    of accepting the first X and rejecting any after that point? The
    first is easier
    to deal with.

    Batch mode was created to be able to more efficiently process
    messages that are
    inserted into databases, we then found that the reduced queue
    congestion was a
    significant advantage in itself.

    But unless you have a queue just for the ES action, doing queue
    manipulation
    isn't possible, all you can do is succeed or fail, and if you
    fail, the retry
    logic will kick in.

    Rainer is going to need to comment on this.

    David Lang

    >
    >> repeat
    >>
    >> all that should be needed is to add tests into omelasticsearch
    to detect
    >> the soft errors and turn them into retries (or suspend the
    output as
    >> appropriate)
    >>
    >> David Lang
    >
    >
    >
    _______________________________________________
    rsyslog mailing list
    http://lists.adiscon.net/mailman/listinfo/rsyslog
    http://www.rsyslog.com/professional-services/
    What's up with rsyslog? Follow https://twitter.com/rgerhards
    NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
    myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT
    POST if you DON'T LIKE THAT.


_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Re: [rsyslog] omelasticsearch - failed operation handling

Reply via email to