[ https://issues.apache.org/jira/browse/FLINK-26315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-26315: ----------------------------------- Labels: auto-deprioritized-critical auto-deprioritized-major failure-recovery restart stale-minor (was: auto-deprioritized-critical auto-deprioritized-major failure-recovery restart) I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized. > Stream job which have multily region would not recover when connection with > zookeeper/hbase lost. > ------------------------------------------------------------------------------------------------- > > Key: FLINK-26315 > URL: https://issues.apache.org/jira/browse/FLINK-26315 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Janick Wu > Priority: Minor > Labels: auto-deprioritized-critical, auto-deprioritized-major, > failure-recovery, restart, stale-minor > Attachments: > [FLINK-26315]_improve_failure-rate_restart_strategy,_support_same_failure_cause_ignore.patch > > > Our platfrom use failure-rate (failure-rate-interval: > 5min,max-failures-per-interval: 6) as default restart-strategy. And > failover-strategy is region level. > Let's asume a job with concurrency of 10, all the edges in stream graph is > FORWARD, then the region count is equal to job parallelism. If more than 5 > Task failed caused by connection lost between Taskmanager and external System > such as zookeeper、hbase, failure rate will exceeded immediately. So our job > will never recover from such situition(very common when use zookeeper for ha). > h2. possible solution: > Imporve failure-rate strategy: record last task failure cause and timestamp,. > If the task failure cause occur multiple times in a short period of time, it > will ingore the rest. > I already implement it and work well. > useage: > {quote}restart-strategy: failure-rate > restart-strategy.failure-rate.cause.insensitive: true > restart-strategy.failure-rate.cause.insensitive-interval: 1min > {quote} > this configure will ignore continuously repeating exception in 1min. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)