[DISCUSS] Support Suspending and Resuming of Flink Jobs

SHI Xiaogang Wed, 12 Oct 2016 02:53:02 -0700

Hi all,

Currently, savepoints are exactly the completed checkpoints, and Flink
provides commands (save/run) to allow saving and restoring jobs. But in the
near future, savepoints will be very different from checkpoints because
they will have common serialization formats and allow recover from major
updates. The saving and restoring based on savepoints will be more costly.


To provide efficient saving and restoring of jobs, we propose to add two
more commands in Flink: SUSPEND and RESUME which are based on checkpoints.

As the implementation of checkpoints depends on the backends (and many
other components in Flink), suspending and resuming may not work if there
exist major changes in the job or Flink (e.g., different backends). But as
the implementation is based on checkpoints instead of savepoints, they are
supposed to be more efficient.

The details of the design can be viewed in the Google Doc: Support Resuming
and Suspending of Flink Jobs
<https://docs.google.com/document/d/1c3vUOTrNlCu2uhfi5ZNYpAguoFR03NgQWZpDTkSxVjg/edit?usp=sharing>
.

Look forward to your comments. Any feedback is appreciated. :)

Thanks,
Xiaogang

[DISCUSS] Support Suspending and Resuming of Flink Jobs

Reply via email to