zhuzhurk opened a new pull request #8430: [FLINK-12068] [runtime] Backtrack 
failover regions if intermediate results are unavailable
URL: https://github.com/apache/flink/pull/8430
 
 
   ## What is the purpose of the change
   
   *In region failover, when a region fails due to unavailable input result 
partitions, it needs to backtrack the failover regions to recover the failed 
tasks as well as the unavailable result partitions. The detailed design is at 
https://docs.google.com/document/d/1YHOpMLdC-dtgjcM-EDn6v-oXgsEQKXSoMjqRcYVbJA8/edit.*
   
   
   ## Brief change log
   
     - *RestartPipelinedRegionStrategy(based on next generation interface) 
handles DataConsumptionException, proposing to restart regions producing needed 
but unavailable result partitions as well as all its consumer regions.*
     - *Add a ResultPartitionAvailabilityChecker interface and implement it for 
querying result partition availability*
     - *Calculate and cache region inputs and consumers in region building 
phase. This helps to speed up the failover handling significantly, at the time 
cost of slows down the region building significantly and space cost for 2 edge 
scale caches. See verification part below.*
   
   
   ## Verifying this change
   
   
   This change added tests and can be verified as follows:
     - *Added tests in flip1 RestartPipelinedRegionStrategyTest to verify the 
correctness*
     - *Performance tests are manually conducted. Currently for a job with 16 
million edges it takes < 200ms to calculate the tasks to restart. The region 
building time in this case increased from 600ms to ~6s to build some helper 
caches.*
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (no)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
     - The serializers: (no)
     - The runtime per-record code paths (performance sensitive): (no)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
     - The S3 file system connector: (no)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (no)
     - If yes, how is the feature documented? (not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to