Currently, savepoints are exactly the completed checkpoints, and Flink
provides commands (save/run) to allow saving and restoring jobs. But in the
near future, savepoints will be very different from checkpoints because
they will have common serialization formats and allow recover from major
updates. The saving and restoring based on savepoints will be more costly.
To provide efficient saving and restoring of jobs, we propose to add two
more commands in Flink: SUSPEND and RESUME which are based on checkpoints.
As the implementation of checkpoints depends on the backends (and many
other components in Flink), suspending and resuming may not work if there
exist major changes in the job or Flink (e.g., different backends). But as
the implementation is based on checkpoints instead of savepoints, they are
supposed to be more efficient.
The details of the design can be viewed in the Google Doc: Support Resuming
and Suspending of Flink Jobs
Look forward to your comments. Any feedback is appreciated. :)