Hi, Dongwoo. IIUC, you mean using savepoint to store a snapshot to other storage if checkpoints fail multiple times due to some long lasting exceptions of external storage, right ? I think it's better to achieve this by an external tool instead of introducing a config like that: 1. it's not so easy to judge whether an exception occurs due to external storage or not sometimes, and it's not so reasonable that we just trigger a savepoint if checkpoints fail multiple times. 2. It's better to let some logic about triggering savepoint, e.g. periodic savepoint, triggering stop-with-savepoint, done by external tools or platform. As you could see from [1], we intend to make their scopes clear.
Maybe you could check the status and failure message by [2] periodically in your external tool or platform and then trigger savepoint or stop-with-savepoint by REST API or CLI. [1] https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/checkpoints_vs_savepoints/ [2] https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/rest_api/#jobs-jobid-checkpoints On Wed, Sep 6, 2023 at 11:05 AM Yanfei Lei <fredia...@gmail.com> wrote: > Hi Dongwoo, > > If the checkpoint has failed > `execution.checkpointing.tolerable-failed-checkpoints` times, then > stopWithSavepoint is likely to fail as well. > If stopWithSavepoint succeeds or fails, will the job just stop? I am > more curious about how this option works with the restart strategy? > > Best, > Yanfei > > > Dongwoo Kim <dongwoo7....@gmail.com> 于2023年9月4日周一 22:17写道: > > > > Hi all, > > I have a proposal that aims to enhance the flink application's > resilience in cases of unexpected failures in checkpoint storages like S3 > or HDFS, > > > > [Background] > > When using self managed S3-compatible object storage, we faced > checkpoint async failures lasting for an extended period more than 30 > minutes, > > leading to multiple job restarts and causing lags in our streaming > application. > > > > [Current Behavior] > > Currently, when the number of checkpoint failures exceeds a predefined > tolerable limit, flink will either restart or fail the job based on how > it's configured. > > In my opinion this does not handle scenarios where the checkpoint > storage itself may be unreliable or experiencing downtime. > > > > [Proposed Feature] > > I propose a config that allows for a graceful job stop with a savepoint > when the tolerable checkpoint failure limit is reached. > > Instead of restarting/failing the job when tolerable checkpoint failure > exceeds, when this new config is set to true just trigger stopWithSavepoint. > > > > This could offer the following benefits. > > - Indication of Checkpoint Storage State: Exceeding tolerable checkpoint > failures could indicate unstable checkpoint storage. > > - Automated Fallback Strategy: When combined with a monitoring cron job, > this feature could act as an automated fallback strategy for handling > unstable checkpoint storage. > > The job would stop safely, take a savepoint, and then you could > automatically restart with different checkpoint storage configured like > switching from S3 to HDFS. > > > > For example let's say checkpoint path is configured to s3 and savepoint > path is configured to hdfs. > > When the new config is set to true the job stops with savepoint like > below when tolerable checkpoint failure exceeds. > > And we can restart the job from that savepoint while the checkpoint > configured as hdfs. > > > > > > > > Looking forward to hearing the community's thoughts on this proposal. > > And also want to ask how the community is handling long lasting unstable > checkpoint storage issues. > > > > Thanks in advance. > > > > Best dongwoo, > -- Best, Hangxiang.