[GitHub] [flink] StephanEwen commented on pull request #16655: [FLINK-23512][runtime][checkpoint] Check for illegal modifications of JobGraph with partially finished operators

GitBox Mon, 09 Aug 2021 03:50:48 -0700


StephanEwen commented on pull request #16655:
URL: https://github.com/apache/flink/pull/16655#issuecomment-895125495



   Thanks for the quick reply!
   
   For most of the questions, looks like we have consensus on how to proceed:
     - The OperatorID issue is in fact not an issue, just a name confusion. I 
see this issue here is created to reduce the confusion: 
https://issues.apache.org/jira/browse/FLINK-23681
     - For the Checkpoint Metadata, we encode the status in the current format, 
but factor out that logic into dedicated methods for clarity.
     - We move the logic for adding the finished state to the `CheckpointPlan`. 
In the future, there is also a possible optimization to cache the set of 
finished states and only re-build it when the `CheckpointPlan` changes.
   
   For the rules of when to allow the changes, I understand the issue you 
raised with not having the other JobGraph to compare to. I think we should 
avoid a way that needs access to the previous job graph.
     - Can we capture this in rules similar to what you have now, but simpler 
rules?
     - Or can we solve this by passing another flag into the 
`restoreCheckpoint()` method, similar to the `allowUnusedState` flag. A flag 
like `allowPartiallyFinishedTasks`, which is only set to `false` if a new job 
is submitted?
     - 
   I would need to think a bit more here, so maybe for the time being, we could 
either
     - keep the rules you have for now, but try to document it already in the 
suggested way (no partially finished tasks). As I understand it, your rules are 
a more special case of that.
     - not have any complex rules there at all: we allow all modifications, but 
caution users that upgrades with finished tasks are currently not well-defined. 
That could be an option if we see that the evaluation of the current rules is 
either fragile or has a too high overhead with high parallelism jobs.
   
   That would be my final suggestion here also: We need to test that the rules 
whether restore is allowed scale well to Execution Graphs with > 10000 
parallelism and around 100 operators. They should not take longer than few 
milliseconds to execute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] StephanEwen commented on pull request #16655: [FLINK-23512][runtime][checkpoint] Check for illegal modifications of JobGraph with partially finished operators

Reply via email to