[ https://issues.apache.org/jira/browse/FLINK-26315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-26315: ----------------------------------- Labels: auto-deprioritized-critical auto-deprioritized-major auto-deprioritized-minor failure-recovery restart (was: auto-deprioritized-critical auto-deprioritized-major failure-recovery restart stale-minor) Priority: Not a Priority (was: Minor) This issue was labeled "stale-minor" 7 days ago and has not received any updates so it is being deprioritized. If this ticket is actually Minor, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Stream job which have multily region would not recover when connection with > zookeeper/hbase lost. > ------------------------------------------------------------------------------------------------- > > Key: FLINK-26315 > URL: https://issues.apache.org/jira/browse/FLINK-26315 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Janick Wu > Priority: Not a Priority > Labels: auto-deprioritized-critical, auto-deprioritized-major, > auto-deprioritized-minor, failure-recovery, restart > Attachments: > [FLINK-26315]_improve_failure-rate_restart_strategy,_support_same_failure_cause_ignore.patch > > > Our platfrom use failure-rate (failure-rate-interval: > 5min,max-failures-per-interval: 6) as default restart-strategy. And > failover-strategy is region level. > Let's asume a job with concurrency of 10, all the edges in stream graph is > FORWARD, then the region count is equal to job parallelism. If more than 5 > Task failed caused by connection lost between Taskmanager and external System > such as zookeeper、hbase, failure rate will exceeded immediately. So our job > will never recover from such situition(very common when use zookeeper for ha). > h2. possible solution: > Imporve failure-rate strategy: record last task failure cause and timestamp,. > If the task failure cause occur multiple times in a short period of time, it > will ingore the rest. > I already implement it and work well. > useage: > {quote}restart-strategy: failure-rate > restart-strategy.failure-rate.cause.insensitive: true > restart-strategy.failure-rate.cause.insensitive-interval: 1min > {quote} > this configure will ignore continuously repeating exception in 1min. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)