[ 
https://issues.apache.org/jira/browse/FLINK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381476#comment-16381476
 ] 

Sihua Zhou commented on FLINK-8753:
-----------------------------------

Sorry for the interruption, but after have a look at the code of 
{{JobMaster#rescaleOperators}}, which is used for supporting online rescaling. 
I found the {{checkpoint & savepoint}} become a bit confused now. In 
{{JobMaster#rescaleOperators}} it triggers a savepoint that is called 
{{lastInternalSavepoint}}, it's name make me feeling that it not like the 
savepoint as that Aljoscha mentioned above(which will aim to be unified between 
backends finally). The `lastInternalSavepoint` is a savepoint that just aim to 
rescale the job, which is the same thing this JIRA wanted (but the performance 
is a problem because it also go though the fully checkpoint). So can I think 
that, what flink wants for {{checkpoint & savepoint}} are 3 different things:

- checkpoint, which doesn't support rescaling, just used for recover from 
failure, the best performance.
- internalSavepoint, which support rescaling, but is not unified between 
backends, highly performance but less than checkpoint. (maybe like the 
{{archive checkpoint}} that Stephan mentioned above)
- savepoint, which support rescaling, and is unified between backends, 
performance less than {{internalSavepoint}}.

Sorry for the interruption again, but can you help me to understand these? 
[~aljoscha][~StephanEwen]

> Introduce savepoint that go though the incremental checkpoint path
> ------------------------------------------------------------------
>
>                 Key: FLINK-8753
>                 URL: https://issues.apache.org/jira/browse/FLINK-8753
>             Project: Flink
>          Issue Type: New Feature
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.5.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Major
>
> Right now, savepoint goes through the full checkpoint path, take a savepoint 
> could be slowly. In our production, for some long term job it often costs 
> more than 10min to complete a savepoint which is unacceptable for a real time 
> job, so we have to turn back to use the externalized checkpoint instead 
> currently. But the externalized  checkpoint has a time interval (checkpoint 
> interval) between the last time. So I proposal to introduce the increment 
> savepoint which goes through the increment checkpoint path.
> Any advice would be appreciated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to