[ 
https://issues.apache.org/jira/browse/FLINK-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-13055:
----------------------------
    Description: 
In current region failover process, most of the input result partition states 
are unknown. Even though the failure cause is a PartitionException, only one 
unhealthy partition can be identified.

The may lead to multiple unsuccessful failovers before all the unhealthy but 
needed partitions are identified and their producers are involved in the 
failover as well. (unsuccessful failover here means the recovered tasks get 
failed again soon due to some missing input partitions.)

Using JM side tracked partition states to help the region failover to identify 
unhealthy(missing) partitions earlier can help with this case.

The basic idea is to build RestartPipelinedRegionStrategy with a 
ResultPartitionAvailabilityChecker which can query the JM side tracked 
partition states.

  was:
In current region failover process, most of the input result partition states 
are unknown. Even though the failure cause is a PartitionException, only one 
unhealthy partition can be identified.

The may lead to multiple unsuccessful failovers before all the unhealthy but 
needed partitions are identified and their producers are involved in the 
failover as well.

Using JM side tracked partition states to help the region failover to identify 
unhealthy(missing) partitions earlier can help with this case.

The basic idea is to build RestartPipelinedRegionStrategy with a 
ResultPartitionAvailabilityChecker which can query the JM side tracked 
partition states.


> Leverage JM side partition state to improve region failover experience
> ----------------------------------------------------------------------
>
>                 Key: FLINK-13055
>                 URL: https://issues.apache.org/jira/browse/FLINK-13055
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Major
>
> In current region failover process, most of the input result partition states 
> are unknown. Even though the failure cause is a PartitionException, only one 
> unhealthy partition can be identified.
> The may lead to multiple unsuccessful failovers before all the unhealthy but 
> needed partitions are identified and their producers are involved in the 
> failover as well. (unsuccessful failover here means the recovered tasks get 
> failed again soon due to some missing input partitions.)
> Using JM side tracked partition states to help the region failover to 
> identify unhealthy(missing) partitions earlier can help with this case.
> The basic idea is to build RestartPipelinedRegionStrategy with a 
> ResultPartitionAvailabilityChecker which can query the JM side tracked 
> partition states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to