[
https://issues.apache.org/jira/browse/FLINK-36319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hong Liang Teoh resolved FLINK-36319.
-------------------------------------
Resolution: Fixed
merged commit
[{{776a0d2}}|https://github.com/apache/flink-connector-prometheus/commit/776a0d24d8581f47537100cbae27adb2c0cae407]
into apache:main
> FAIL behavior on non-retriable write errors causes an infinite loop when
> restarting from checkpoint
> ---------------------------------------------------------------------------------------------------
>
> Key: FLINK-36319
> URL: https://issues.apache.org/jira/browse/FLINK-36319
> Project: Flink
> Issue Type: Sub-task
> Reporter: Lorenzo Nicora
> Assignee: Lorenzo Nicora
> Priority: Major
> Labels: pull-request-available
>
> The {{FAIL}} (default) error handling behavior when a write request is
> rejected as non-retriable ({{{}onPrometheusNonRetriableError{}}}), causes the
> job to fail and restart.
> Restarting from checkpoint causes some out-of-order (duplicate) writes, that
> by default Prometheus rejects as non-retrable.
> As a consequence, when {{onPrometheusNonRetriableError}} = {{FAIL}} any
> restarts from checkpoint puts the job in an infinite loop.
> Changes:
> 1. default {{onPrometheusNonRetriableError}} should be
> {{DISCARD_AND_CONTINUE}}
> 2. {{onPrometheusNonRetriableError}} cannot be set to {{FAIL}}
> 3. Amend docs
> We can keep the rest of the implementation as-is for the moment, and just
> prevent from setting {{FAIL}} for this behaviour, as we may expand handling
> this error with a different behaviour
--
This message was sent by Atlassian Jira
(v8.20.10#820010)