gaoyunhaii commented on pull request #16655: URL: https://github.com/apache/flink/pull/16655#issuecomment-895257415
Hi @StephanEwen, very thanks for the suggestions! I'm currently fixing the code, and for the remaining issues: > For the rules of when to allow the changes, I understand the issue you raised with not having the other JobGraph to > compare to. I think we should avoid a way that needs access to the previous job graph. > > - Can we capture this in rules similar to what you have now, but simpler rules? > - Or can we solve this by passing another flag into the restoreCheckpoint() method, similar to the allowUnusedState flag. A flag like allowPartiallyFinishedTasks, which is only set to false if a new job is submitted? - I also agree now we are checking the specific cases for the more general rules, but it currently seems to me we could not further simple the rule since we might have to consider the cartesian product of `[partly finished, fully finished]` * `[pointwise, all_to_all]`. - I think we indeed could add a flag to only do the check on the initial restore (`restoreInitialCheckpointIfPresent` and `restoreSavepoint`. We could always benefit from the flag for failover if we actually has the check. But on the other side, we might still need to consider the same issue for the initial restoring ? > I would need to think a bit more here, so maybe for the time being, we could either > > - keep the rules you have for now, but try to document it already in the suggested way (no partially finished tasks). As I understand it, your rules are a more special case of that. > - not have any complex rules there at all: we allow all modifications, but caution users that upgrades with finished tasks are currently not well-defined. That could be an option if we see that the evaluation of the current rules is either fragile or has a too high overhead with high parallelism jobs. > > That would be my final suggestion here also: We need to test that the rules whether restore is allowed scale well to Execution Graphs with > 10000 parallelism and around 100 operators. They should not take longer than few milliseconds to execute. I agree with that we could only promise to the users that we do not support topology change with finished tasks. I'll first run the tests regarding the performance so that we could make the decision based on the tests results~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
