[
https://issues.apache.org/jira/browse/FLINK-26683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510537#comment-17510537
]
Piotr Nowojski commented on FLINK-26683:
----------------------------------------
{quote}
We suggest to change it so that when a failure appears during committing side
effects for stop-with-savepoint we restore the state only to commit
side-effects. We do not start a proper job and do not consume records. Such an
information would need to be stored alongside the savepoint in case of
stop-with-savepoint --drain, whereas it could be transient only in the
CompletedCheckpointStore in case of stop-with-savepoint --no-drain. The
difference results from the fact that --no-drain can and usually is used for
restarting regular processing. That is not the case for drain, which might be
ever restored purely for the purpose of committing side-effects.
{quote}
Just to elaborate a bit on the difference about persistent and transient store
of the information if this savepoint is stop-with-savepoint with or without
drain.
In either case, we would like Flink to keep restarting the job and retrying to
commit side effects as long as possible/configured (number of restarts).
However such cycle of restarts can be interrupted in many ways, leaving user
with a "consistent savepoint, that has potentially uncommitted side-effects".
User would have to thus manually restart the job from such savepoint. Now, in
case of "with drain", for the sake of consistency, we can not restart
processing new records, unless user explicitly overrides this behaviour. So we
need to persist the information that this has been "stop-with-savepoint with
drain". On the other hand, in the "without drain" case, manual restart from
such savepoint could work either way. It could restart in "stop-with-savepoint
without drain" mode, to just finish committing side effects, or it could be
interpreted as normal restart from a savepoint. Both options are correct
semantically, the question might be which has more sense/is more consistent.
> Terminate the job anyway if savepoint finished when stop-with-savepoint
> -----------------------------------------------------------------------
>
> Key: FLINK-26683
> URL: https://issues.apache.org/jira/browse/FLINK-26683
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.15.0, 1.14.4
> Reporter: Liu
> Priority: Major
> Fix For: 1.16.0
>
>
> When we stop with savepoint, the savepoint finishes. But some tasks failover
> for some reason and restart to running. In the end, some tasks are finished
> and some tasks are running. In this case, I think that we should terminate
> all the tasks anyway instead of restarting since the savepoint is finished
> and the job stops consuming data. What do you think?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)