gaoyunhaii commented on pull request #16655:
URL: https://github.com/apache/flink/pull/16655#issuecomment-895257415


   Hi @StephanEwen, very thanks for the suggestions! I'm currently fixing the 
code, and for the remaining issues:
   
   > For the rules of when to allow the changes, I understand the issue you 
raised with not having the other JobGraph to > compare to. I think we should 
avoid a way that needs access to the previous job graph.
   > 
   > - Can we capture this in rules similar to what you have now, but simpler 
rules?
   > - Or can we solve this by passing another flag into the 
restoreCheckpoint() method, similar to the allowUnusedState flag. A flag like 
allowPartiallyFinishedTasks, which is only set to false if a new job is 
submitted?
   
   - I also agree now we are checking the specific cases for the more general 
rules, but it currently seems to me we could not further simple the rule since 
we might have to consider the cartesian product of `[partly finished, fully 
finished]` * `[pointwise, all_to_all]`.  
   - I think we indeed could add a flag to only do the check on the initial 
restore (`restoreInitialCheckpointIfPresent` and `restoreSavepoint`. We could 
always benefit from the flag for failover if we actually has the check. But on 
the other side, we might still need to consider the same issue for the initial 
restoring ?
   
   
   > I would need to think a bit more here, so maybe for the time being, we 
could either
   > 
   > - keep the rules you have for now, but try to document it already in the 
suggested way (no partially finished tasks). As I understand it, your rules are 
a more special case of that.
   > - not have any complex rules there at all: we allow all modifications, but 
caution users that upgrades with finished tasks are currently not well-defined. 
That could be an option if we see that the evaluation of the current rules is 
either fragile or has a too high overhead with high parallelism jobs.
   > 
   > That would be my final suggestion here also: We need to test that the 
rules whether restore is allowed scale well to Execution Graphs with > 10000 
parallelism and around 100 operators. They should not take longer than few 
milliseconds to execute.
   
   I agree with that we could only promise to the users that we do not support 
topology change with finished tasks. I'll first run the tests regarding the 
performance so that we could make the decision based on the tests results~ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to