[ 
https://issues.apache.org/jira/browse/FLINK-26683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510537#comment-17510537
 ] 

Piotr Nowojski commented on FLINK-26683:
----------------------------------------

{quote}
We suggest to change it so that when a failure appears during committing side 
effects for stop-with-savepoint we restore the state only to commit 
side-effects. We do not start a proper job and do not consume records. Such an 
information would need to be stored alongside the savepoint in case of 
stop-with-savepoint --drain, whereas it could be transient only in the 
CompletedCheckpointStore in case of stop-with-savepoint --no-drain. The 
difference results from the fact that --no-drain can and usually is used for 
restarting regular processing. That is not the case for drain, which might be 
ever restored purely for the purpose of committing side-effects.
{quote}
Just to elaborate a bit on the difference about persistent and transient store 
of the information if this savepoint is stop-with-savepoint with or without 
drain. 

In either case, we would like Flink to keep restarting the job and retrying to 
commit side effects as long as possible/configured (number of restarts). 
However such cycle of restarts can be interrupted in many ways, leaving user 
with a "consistent savepoint, that has potentially uncommitted side-effects".  
User would have to thus manually restart the job from such savepoint. Now, in 
case of "with drain", for the sake of consistency, we can not restart 
processing new records, unless user explicitly overrides this behaviour. So we 
need to persist the information that this has been "stop-with-savepoint with 
drain". On the other hand, in the "without drain" case, manual restart from 
such savepoint could work either way. It could restart in "stop-with-savepoint 
without drain" mode, to just finish committing side effects, or it could be 
interpreted as normal restart from a savepoint. Both options are correct 
semantically, the question might be which has more sense/is more consistent.

> Terminate the job anyway if savepoint finished when stop-with-savepoint
> -----------------------------------------------------------------------
>
>                 Key: FLINK-26683
>                 URL: https://issues.apache.org/jira/browse/FLINK-26683
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.4
>            Reporter: Liu
>            Priority: Major
>             Fix For: 1.16.0
>
>
> When we stop with savepoint, the savepoint finishes. But some tasks failover 
> for some reason and restart to running. In the end, some tasks are finished 
> and some tasks are running. In this case, I think that we should terminate 
> all the tasks anyway instead of restarting since the savepoint is finished 
> and the job stops consuming data. What do you think?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to