I want to discuss solutions for the problem that I have described in
METRON-1832; Recurrent Large Indexing Error Messages. I feel this is a very
easy trap to fall into when using the default settings that currently come
with Metron.

## Problem


If any index destination like HDFS, Elasticsearch, or Solr goes down while
the Indexing topology is running, an error message is created and sent back
to the user-defined error topic.  By default, this is defined to also be
the 'indexing' topic.

The Indexing topology then consumes this error message and attempts to
write it again. If the index destination is still down, another error
occurs and another error message is created that encapsulates the original
error message.  That message is then sent to the 'indexing' topic, which is
later consumed, yet again, by the Indexing topology.

These error messages will continue to be recycled and grow larger and
larger as each new error message encapsulates all previous error messages
in the "raw_message" field.

Once the index destination recovers, one giant error message will finally
be written that contains massively duplicated, useless information which
can further negatively impact performance of the index destination.

Also, the escape character '\' continually compounds one another leading to
long strings of '\' characters in the error message.

## Background

There was some discussion on how to handle this on the original PR #453

## Solutions

(1) The first, easiest option is to just do nothing.  There was already a
discussion around this and this is the solution that we landed on in #453.

Pros: Really easy; do nothing.

Cons: Intermittent problems with ES/Solr can easily create very large error
messages that can significantly impact both search and ingest performance.

(2) Change the default indexing error topic to 'indexing_errors' to avoid
recycling error messages. Nothing will consume from the 'indexing_errors'
topic, thus preventing a cycle.

Pros: Simple, easy change that prevents the cycle.

Cons: Recoverable indexing errors are not visible to users as they will
never be indexed in ES/Solr.

(2) Add logic to limit the number times a message can be 'recycled' through
the Indexing topology.  This effectively sets a maximum number of retry
attempts.  If a message fails N times, then write the message to a separate
unrecoverable, error topic.

Pros: Recoverable errors are visible to users in ES/Solr.

Cons: More complex.  Users still need to check the unrecoverable, error
topic for potential problems.

(4) Do not further encapsulate error messages in the 'raw_message' field.
If an error message fails, don't encapsulate it in another error message.
Just push it to the error topic as-is.  Could add a field that indicates
how many times the message has failed.

Pros: Prevents giant error messages from being created from recoverable

Cons: Extended outages would still cause the Indexing topology to
repeatedly recycle these error messages, which would ultimately exhaust
resources in Storm.

What other ways can we solve this?

Reply via email to