[ 
https://issues.apache.org/jira/browse/FLINK-26315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Janick Wu updated FLINK-26315:
------------------------------
    Description: 
  Our platfrom use failure-rate (failure-rate-interval: 
5min,max-failures-per-interval: 6) as default restart-strategy. And 
failover-strategy is region level.
  Let's asume a job with concurrency of 10, all the edges in stream graph is 
FORWARD, then the region count is equal to job parallelism. If more than 5 Task 
failed caused by connection lost between Taskmanager and external System such 
as zookeeper、hbase, failure rate will exceeded immediately. So our job will 
never recover from such situition(very common when use zookeeper for ha).
h2. possible solution:

Imporve failure-rate strategy: record last task failure cause and timestamp,. 
If the task failure cause  occur multiple times in a short period of time, it 
will ingore the rest.

  was:
  Our platfrom use failure-rate (failure-rate-interval: 
5min,max-failures-per-interval: 6) as default restart-strategy. And 
failover-strategy is region level.
  Let's asume a job with concurrency of 10, all the edges in stream graph is 
FORWARD, then the region count is equal to job parallelism. If more than 5 Task 
failed caused by connection lost between Taskmanager and external System such 
as zookeeper、hbase, 
failure rate will exceeded immediately. So our job will never recover from such 
situition.
h2. possible solution:

Imporve failure-rate strategy: record last task failure cause and timestamp,. 
If the task failure cause  occur multiple times in a short period of time, it 
will ingore the rest.


> Stream job which have multily region would not recover when connection with 
> zookeeper/hbase lost.
> -------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26315
>                 URL: https://issues.apache.org/jira/browse/FLINK-26315
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.12.0
>            Reporter: Janick Wu
>            Priority: Critical
>
>   Our platfrom use failure-rate (failure-rate-interval: 
> 5min,max-failures-per-interval: 6) as default restart-strategy. And 
> failover-strategy is region level.
>   Let's asume a job with concurrency of 10, all the edges in stream graph is 
> FORWARD, then the region count is equal to job parallelism. If more than 5 
> Task failed caused by connection lost between Taskmanager and external System 
> such as zookeeper、hbase, failure rate will exceeded immediately. So our job 
> will never recover from such situition(very common when use zookeeper for ha).
> h2. possible solution:
> Imporve failure-rate strategy: record last task failure cause and timestamp,. 
> If the task failure cause  occur multiple times in a short period of time, it 
> will ingore the rest.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to