[jira] [Created] (FLINK-35546) Elasticsearch 8 connector fails fast for non-retryable bulk request items

Mingliang Liu (Jira) Thu, 06 Jun 2024 23:29:07 -0700

Mingliang Liu created FLINK-35546:
-------------------------------------

             Summary: Elasticsearch 8 connector fails fast for non-retryable 
bulk request items
                 Key: FLINK-35546
                 URL: https://issues.apache.org/jira/browse/FLINK-35546
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / ElasticSearch
            Reporter: Mingliang Liu

Discussion thread:
[https://lists.apache.org/thread/yrf0mmbch0lhk3rgkz94fr0x5qz2417l]

{quote}
Currently the Elasticsearch 8 connector retries all items if the request fails
as a whole, and retries failed items if the request has partial failures
[[1|https://github.com/apache/flink-connector-elasticsearch/blob/5d1f8d03e3cff197ed7fe30b79951e44808b48fe/flink-connector-elasticsearch8/src/main/java/org/apache/flink/connector/elasticsearch/sink/Elasticsearch8AsyncWriter.java#L152-L170]\].
I think this infinitely retries might be problematic in some cases when
retrying can never eventually succeed. For example, if the request is 400 (bad
request) or 404 (not found), retries do not help. If there are too many failed
items non-retriable, new requests will get processed less effectively. In
extreme cases, it may stall the pipeline if in-flight requests are occupied by
those failed items.

FLIP-451 proposes timeout for retrying which helps with un-acknowledged
requests, but not addressing the case when request gets processed and failed
items keep failing no matter how many times we retry. Correct me if I'm wrong.

One opinionated option is to fail fast for non-retriable errors like 400 / 404
and to drop items for 409. Or we can allow users to configure "drop/fail"
behavior for non-retriable errors. I prefer the latter. I checked how LogStash
ingests data to Elasticsearch and it takes a similar approach for non-retriable
errors
[[2|https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/main/lib/logstash/plugin_mixins/elasticsearch/common.rb#L283-L304]\].
In my day job, we have a dead-letter-queue in AsynSinkWriter for failed
entries that exhaust retries. I guess that is too specific to our setup and
seems an overkill here for Elasticsearch connector.
{quote}

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35546) Elasticsearch 8 connector fails fast for non-retryable bulk request items

Reply via email to