[
https://issues.apache.org/jira/browse/FLINK-35546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853036#comment-17853036
]
Mingliang Liu commented on FLINK-35546:
---------------------------------------
I plan to submit a PR for discussion. I think failing fast is much better than
a later time when the pipeline is stalled with many non-retryable operations.
> Elasticsearch 8 connector fails fast for non-retryable bulk request items
> -------------------------------------------------------------------------
>
> Key: FLINK-35546
> URL: https://issues.apache.org/jira/browse/FLINK-35546
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / ElasticSearch
> Reporter: Mingliang Liu
> Priority: Major
>
> Discussion thread:
> [https://lists.apache.org/thread/yrf0mmbch0lhk3rgkz94fr0x5qz2417l]
> {quote}
> Currently the Elasticsearch 8 connector retries all items if the request
> fails as a whole, and retries failed items if the request has partial
> failures
> [[1|https://github.com/apache/flink-connector-elasticsearch/blob/5d1f8d03e3cff197ed7fe30b79951e44808b48fe/flink-connector-elasticsearch8/src/main/java/org/apache/flink/connector/elasticsearch/sink/Elasticsearch8AsyncWriter.java#L152-L170]\].
> I think this infinitely retries might be problematic in some cases when
> retrying can never eventually succeed. For example, if the request is 400
> (bad request) or 404 (not found), retries do not help. If there are too many
> failed items non-retriable, new requests will get processed less effectively.
> In extreme cases, it may stall the pipeline if in-flight requests are
> occupied by those failed items.
> FLIP-451 proposes timeout for retrying which helps with un-acknowledged
> requests, but not addressing the case when request gets processed and failed
> items keep failing no matter how many times we retry. Correct me if I'm wrong.
> One opinionated option is to fail fast for non-retriable errors like 400 /
> 404 and to drop items for 409. Or we can allow users to configure "drop/fail"
> behavior for non-retriable errors. I prefer the latter. I checked how
> LogStash ingests data to Elasticsearch and it takes a similar approach for
> non-retriable errors
> [[2|https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/main/lib/logstash/plugin_mixins/elasticsearch/common.rb#L283-L304]\].
> In my day job, we have a dead-letter-queue in AsynSinkWriter for failed
> entries that exhaust retries. I guess that is too specific to our setup and
> seems an overkill here for Elasticsearch connector.
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)