[
https://issues.apache.org/jira/browse/FLINK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381476#comment-16381476
]
Sihua Zhou commented on FLINK-8753:
-----------------------------------
Sorry for the interruption, but after have a look at the code of
{{JobMaster#rescaleOperators}}, which is used for supporting online rescaling.
I found the {{checkpoint & savepoint}} become a bit confused now. In
{{JobMaster#rescaleOperators}} it triggers a savepoint that is called
{{lastInternalSavepoint}}, it's name make me feeling that it not like the
savepoint as that Aljoscha mentioned above(which will aim to be unified between
backends finally). The `lastInternalSavepoint` is a savepoint that just aim to
rescale the job, which is the same thing this JIRA wanted (but the performance
is a problem because it also go though the fully checkpoint). So can I think
that, what flink wants for {{checkpoint & savepoint}} are 3 different things:
- checkpoint, which doesn't support rescaling, just used for recover from
failure, the best performance.
- internalSavepoint, which support rescaling, but is not unified between
backends, highly performance but less than checkpoint. (maybe like the
{{archive checkpoint}} that Stephan mentioned above)
- savepoint, which support rescaling, and is unified between backends,
performance less than {{internalSavepoint}}.
Sorry for the interruption again, but can you help me to understand these?
[~aljoscha][~StephanEwen]
> Introduce savepoint that go though the incremental checkpoint path
> ------------------------------------------------------------------
>
> Key: FLINK-8753
> URL: https://issues.apache.org/jira/browse/FLINK-8753
> Project: Flink
> Issue Type: New Feature
> Components: State Backends, Checkpointing
> Affects Versions: 1.5.0
> Reporter: Sihua Zhou
> Assignee: Sihua Zhou
> Priority: Major
>
> Right now, savepoint goes through the full checkpoint path, take a savepoint
> could be slowly. In our production, for some long term job it often costs
> more than 10min to complete a savepoint which is unacceptable for a real time
> job, so we have to turn back to use the externalized checkpoint instead
> currently. But the externalized checkpoint has a time interval (checkpoint
> interval) between the last time. So I proposal to introduce the increment
> savepoint which goes through the increment checkpoint path.
> Any advice would be appreciated!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)