[jira] [Commented] (FLINK-35546) Elasticsearch 8 connector fails fast for non-retryable bulk request items

Mingliang Liu (Jira) Thu, 06 Jun 2024 23:30:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853036#comment-17853036
 ]


Mingliang Liu commented on FLINK-35546:
---------------------------------------

I plan to submit a PR for discussion. I think failing fast is much better than 
a later time when the pipeline is stalled with many non-retryable operations.

> Elasticsearch 8 connector fails fast for non-retryable bulk request items
> -------------------------------------------------------------------------
>
>                 Key: FLINK-35546
>                 URL: https://issues.apache.org/jira/browse/FLINK-35546
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / ElasticSearch
>            Reporter: Mingliang Liu
>            Priority: Major
>
> Discussion thread: 
> [https://lists.apache.org/thread/yrf0mmbch0lhk3rgkz94fr0x5qz2417l]
> {quote}
> Currently the Elasticsearch 8 connector retries all items if the request 
> fails as a whole, and retries failed items if the request has partial 
> failures 
> [[1|https://github.com/apache/flink-connector-elasticsearch/blob/5d1f8d03e3cff197ed7fe30b79951e44808b48fe/flink-connector-elasticsearch8/src/main/java/org/apache/flink/connector/elasticsearch/sink/Elasticsearch8AsyncWriter.java#L152-L170]\].
>  I think this infinitely retries might be problematic in some cases when 
> retrying can never eventually succeed. For example, if the request is 400 
> (bad request) or 404 (not found), retries do not help. If there are too many 
> failed items non-retriable, new requests will get processed less effectively. 
> In extreme cases, it may stall the pipeline if in-flight requests are 
> occupied by those failed items.
> FLIP-451 proposes timeout for retrying which helps with un-acknowledged 
> requests, but not addressing the case when request gets processed and failed 
> items keep failing no matter how many times we retry. Correct me if I'm wrong.
> One opinionated option is to fail fast for non-retriable errors like 400 / 
> 404 and to drop items for 409. Or we can allow users to configure "drop/fail" 
> behavior for non-retriable errors. I prefer the latter. I checked how 
> LogStash ingests data to Elasticsearch and it takes a similar approach for 
> non-retriable errors 
> [[2|https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/main/lib/logstash/plugin_mixins/elasticsearch/common.rb#L283-L304]\].
>  In my day job, we have a dead-letter-queue in AsynSinkWriter for failed 
> entries that exhaust retries. I guess that is too specific to our setup and 
> seems an overkill here for Elasticsearch connector.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35546) Elasticsearch 8 connector fails fast for non-retryable bulk request items

Reply via email to