StephanEwen edited a comment on pull request #16655:
URL: https://github.com/apache/flink/pull/16655#issuecomment-894305960


   Thanks for the discussion and digging out the corner cases to check. 
   I was wondering is a very simple rule could actually be good enough here:
   
   *You can only change the JobGraph operator chains from a checkpoint if it 
has no partially finished tasks.*
   
   That would be both simple to implement, simple to explain to users, and 
would prevent all the corner cases we talked about here. The question is: Is it 
too restrictive, will it make some common case impossible or in-practical?
   
   _What use cases to we need to support for changing the JobGraph with 
finished tasks?_
    
     1. Rescaling due to lack of resources and in reactive mode.
     
        This doesn't change the JobGraph structure, but only the parallelism of 
operators (including possibly operators that are partially finished). 
@gaoyunhaii is it guaranteed that this still works?
     
     2. Upgrading an application after "stop-with-savepoint" (maybe 
"stop-with-checkpoint" in the future).
     
         This should be fine, because the final checkpoint should have all 
tasks finished.
         @pnowojski Can we double check this is also guaranteed with Unaligned 
Checkpoints?
     
     3. For application joining some bounded inputs and some unbounded inputs, 
the application often runs the majority of the time after the bounded input 
finished. They should still be modifiable then.
     
         This should also be fine because in that case the bounded source and 
the bounded stream sub-flow have only a small time window in which they are 
partially finished. After that, for they are permanently finished.
   
   _What use cases would be not support?_
   
     - Upgrades in the middle of a stopping procedure, which has partially 
completed.
     - Upgrades in the small window while a bounded join input has partially 
completed (see point (3) above).
   
   Maybe I am overlooking some important use cases here. But if the above 
assessment of use cases is correct, this simplification could simplify our code 
a lot, simplify the understanding for the users, also avoid some unexpected 
surprises for users.
   
   Yun Gao mentioned an interesting case with `reinterpretAsKeyedStream()`. 
Such cases where users make some non-standard assumptions always exist and 
confuse the users. It would be more explicit for users if they could not 
restore their checkpoint with changed JobGraph (re-scaling would still work, 
though) but would get an explicit error and need to bring the job into stable 
state again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to