Janick Wu created FLINK-26315:
---------------------------------
Summary: Stream job which have multily region would not recover
when connection with zookeeper/hbase lost.
Key: FLINK-26315
URL: https://issues.apache.org/jira/browse/FLINK-26315
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination, Runtime / Task
Affects Versions: 1.12.0
Reporter: Janick Wu
Our platfrom use failure-rate (failure-rate-interval:
5min,max-failures-per-interval: 6) as default restart-strategy. And
failover-strategy is region level.
Let's asume a job with concurrency of 10, all the edges in stream graph is
FORWARD, then the region count is equal to job parallelism. If more than 5 Task
failed caused by connection lost between Taskmanager and external System such
as zookeeper、hbase,
failure rate will exceeded immediately. So our job will never recover from such
situition.
h2. possible solution:
Imporve failure-rate strategy: record last task failure cause and timestamp,.
If the task failure cause occur multiple times in a short period of time, it
will ingore the rest.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)