StephanEwen edited a comment on pull request #16655:
URL: https://github.com/apache/flink/pull/16655#issuecomment-894305960
Thanks for the discussion and digging out the corner cases to check.
I was wondering is a very simple rule could actually be good enough here:
*You can only change the JobGraph operator chains from a checkpoint if it
has no partially finished tasks.*
That would be both simple to implement, simple to explain to users, and
would prevent all the corner cases we talked about here. The question is: Is it
too restrictive, will it make some common case impossible or in-practical?
_What use cases to we need to support for changing the JobGraph with
finished tasks?_
1. Rescaling due to lack of resources and in reactive mode.
This doesn't change the JobGraph structure, but only the parallelism of
operators (including possibly operators that are partially finished).
@gaoyunhaii is it guaranteed that this still works?
2. Upgrading an application after "stop-with-savepoint" (maybe
"stop-with-checkpoint" in the future).
This should be fine, because the final checkpoint should have all
tasks finished.
@pnowojski Can we double check this is also guaranteed with Unaligned
Checkpoints?
3. For application joining some bounded inputs and some unbounded inputs,
the application often runs the majority of the time after the bounded input
finished. They should still be modifiable then.
This should also be fine because in that case the bounded source and
the bounded stream sub-flow have only a small time window in which they are
partially finished. After that, for they are permanently finished.
_What use cases would be not support?_
- Upgrades in the middle of a stopping procedure, which has partially
completed.
- Upgrades in the small window while a bounded join input has partially
completed (see point (3) above).
Maybe I am overlooking some important use cases here. But if the above
assessment of use cases is correct, this simplification could simplify our code
a lot, simplify the understanding for the users, also avoid some unexpected
surprises for users.
Yun Gao mentioned an interesting case with `reinterpretAsKeyedStream()`.
Such cases where users make some non-standard assumptions always exist and
confuse the users. It would be more explicit for users if they could not
restore their checkpoint with changed JobGraph (re-scaling would still work,
though) but would get an explicit error and need to bring the job into stable
state again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]