[ https://issues.apache.org/jira/browse/FLINK-14439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gary Yao reassigned FLINK-14439: -------------------------------- Assignee: Gary Yao > RestartPipelinedRegionStrategy leverage tracked partition availability for > better failover experience in DefaultScheduler > -------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-14439 > URL: https://issues.apache.org/jira/browse/FLINK-14439 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Reporter: Zhu Zhu > Assignee: Gary Yao > Priority: Major > Fix For: 1.10.0 > > > In current region failover when using DefaultScheduler, most of the input > result partition states are unknown. Even though the failure cause is a > PartitionException, only one unhealthy partition can be identified. > The may lead to multiple unsuccessful failovers before all the unhealthy but > needed partitions are identified and their producers are involved in the > failover as well. (unsuccessful failover here means the recovered tasks get > failed again soon due to some missing input partitions.) > Using JM side tracked partition states to help the region failover to > identify unhealthy(missing) partitions earlier can help with this case. > To achieve it, I'd propose as follows: > 1. Change {{FailoverStrategy.Factory#create(FailoverTopology)}} to > {{FailoverStrategy.Factory#create(FailoverTopology, > ResultPartitionAvailabilityChecker)}}. > 2. Add {{schedulerBase#getResultPartitionAvailabilityChecker}} which returns > {{getExecutionGraph().getResultPartitionAvailabilityChecker()}} > 3. In DefaultScheduler use the ResultPartitionAvailabilityChecker from > SchedulerBase to create the failover strategy from the factory > It also fails BatchFineGrainedRecoveryITCase due to unexpected failover > counts. This is because the legacy scheduler already has similar optimization > in FLINK-13055. -- This message was sent by Atlassian Jira (v8.3.4#803005)