zhuzhurk commented on code in PR #24855: URL: https://github.com/apache/flink/pull/24855#discussion_r1617628663
########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start Review Comment: For streaming jobs, it can ... However, for batch jobs, which do not have checkpoints, it will ... -> Streaming jobs can resume from latest successful checkpoints. Batch jobs, however, do not have checkpoints and have to start over from the beginning, losing all previously made progress. ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster +failover, avoiding the need to rerun tasks that have already been finished. + +To implement this feature, we introduced a JobEventStore component. This component allows Flink to record state change Review Comment: we introduced a JobEventStore component. This component allows Flink record -> a JobEventStore component is introduced to record ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster +failover, avoiding the need to rerun tasks that have already been finished. + +To implement this feature, we introduced a JobEventStore component. This component allows Flink to record state change +events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to an external filesystem. During the crash +and subsequent restart of the JobMaster, TaskManagers will retain the intermediate result +data produced by the job and attempt to reconnect continuously. Once the JobMaster restarts, it will re-establish +connections with TaskManagers and recover the job state based on the retained +intermediate results and the events previously recorded in the JobEventStore, thereby resuming the job's execution +progress. + +## Usage + +This section explains how to enable recovery of batch jobs from JobMaster failures and how to use +configuration options to optimize the recovery process. + +### How to enable batch job Recovery from job master failures + +- Enable cluster high availability: + + To enable the recovery of batch jobs from JobMaster failures, it is essential to first ensure that cluster + high availability (HA) is enabled. Flink supports HA services compatible with both ZooKeeper and Kubernetes. + For detailed configuration steps, please consult the section on [cluster high availability]({{< ref "docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}}) + in the Flink official documentation. +- Configure [execution.batch.job-recovery.enabled]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-enabled): true + +Note that currently only [Adaptive Batch Scheduler]({{< ref "docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) +supports this feature. And Flink batch jobs will use this scheduler by default unless another scheduler is explicitly configured. + +### Configuration Optimization + +To enable batch jobs to recover as much progress as possible after a JobMaster failover, and avoid rerunning tasks +that have already been finished, you can configure the following options for optimization: + +- [execution.batch.job-recovery.snapshot.min-pause]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-snapshot-min-pause): + This setting determines the minimum pause time allowed between snapshots for the OperatorCoordinator and ShuffleMaster. + This parameter could be adjusted based on the expected I/O load of your cluster and the tolerable amount of state regression. + Reduce this interval if smaller state regressions are preferred and a higher I/O load is acceptable. +- [execution.batch.job-recovery.previous-worker.recovery.timeout]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-previous-worker-recovery-timeout): + This setting determines the timeout duration allowed for Shuffle workers to reconnect. During the recovery process, Flink + requests the retained intermediate result data information from the Shuffle Master. If the timeout is reached, + Flink will use all the acquired intermediate result data to recover the state. +- [job-event.store.write-buffer.flush-interval]({{< ref "docs/deployment/config" >}}#job-event-store-write-buffer-flush-interval): + This setting determines the flush interval for the JobEventStore's write buffers. +- [job-event.store.write-buffer.size]({{< ref "docs/deployment/config" >}}#job-event-store-write-buffer-size): This + setting determines the write buffer size in the JobEventStore. When the buffer is full, its contents are flushed to the external + filesystem. + +## Limitations + +- Only compatible with the new Source (FLIP-27): Since the legacy Source has been deprecated, this feature is only + compatible with the new Source. +- Exclusive to the [Adaptive Batch Scheduler]({{< ref "docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler): + Currently, only the Adaptive Batch Scheduler supports the recovery of batch jobs after a + JobMaster failover. As a result, the feature inherits all the + [limitations of the Adaptive Batch Scheduler]({{<ref "docs/deployment/elastic_scaling" >}}#limitations-2). Review Comment: IIRC, `SupportsBatchSnapshot` is also important for this feature to work better. ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster +failover, avoiding the need to rerun tasks that have already been finished. + +To implement this feature, we introduced a JobEventStore component. This component allows Flink to record state change +events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to an external filesystem. During the crash +and subsequent restart of the JobMaster, TaskManagers will retain the intermediate result +data produced by the job and attempt to reconnect continuously. Once the JobMaster restarts, it will re-establish +connections with TaskManagers and recover the job state based on the retained +intermediate results and the events previously recorded in the JobEventStore, thereby resuming the job's execution +progress. + +## Usage + +This section explains how to enable recovery of batch jobs from JobMaster failures and how to use +configuration options to optimize the recovery process. + +### How to enable batch job Recovery from job master failures + +- Enable cluster high availability: + + To enable the recovery of batch jobs from JobMaster failures, it is essential to first ensure that cluster + high availability (HA) is enabled. Flink supports HA services compatible with both ZooKeeper and Kubernetes. + For detailed configuration steps, please consult the section on [cluster high availability]({{< ref "docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}}) Review Comment: For detailed configuration steps, please consult the section on cluster high availability in the Flink official documentation. -> More details of the configuration can be found in the [High Availability] page. ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster Review Comment: To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink version 1.20. The main purpose of this feature is to enable ... -> To address this issue, a batch job recovery mechanism is introduced to enable ... ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: Review Comment: Currently -> Previously This will be the feature document of Flink 1.20. ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster +failover, avoiding the need to rerun tasks that have already been finished. + +To implement this feature, we introduced a JobEventStore component. This component allows Flink to record state change +events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to an external filesystem. During the crash +and subsequent restart of the JobMaster, TaskManagers will retain the intermediate result +data produced by the job and attempt to reconnect continuously. Once the JobMaster restarts, it will re-establish +connections with TaskManagers and recover the job state based on the retained +intermediate results and the events previously recorded in the JobEventStore, thereby resuming the job's execution +progress. + +## Usage + +This section explains how to enable recovery of batch jobs from JobMaster failures and how to use +configuration options to optimize the recovery process. + +### How to enable batch job Recovery from job master failures + +- Enable cluster high availability: + + To enable the recovery of batch jobs from JobMaster failures, it is essential to first ensure that cluster + high availability (HA) is enabled. Flink supports HA services compatible with both ZooKeeper and Kubernetes. Review Comment: Flink supports HA services compatible with both ZooKeeper and Kubernetes -> Flink supports HA services backed by ZooKeeper or Kubernetes. ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster +failover, avoiding the need to rerun tasks that have already been finished. + +To implement this feature, we introduced a JobEventStore component. This component allows Flink to record state change +events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to an external filesystem. During the crash +and subsequent restart of the JobMaster, TaskManagers will retain the intermediate result +data produced by the job and attempt to reconnect continuously. Once the JobMaster restarts, it will re-establish +connections with TaskManagers and recover the job state based on the retained +intermediate results and the events previously recorded in the JobEventStore, thereby resuming the job's execution +progress. + +## Usage + +This section explains how to enable recovery of batch jobs from JobMaster failures and how to use +configuration options to optimize the recovery process. + +### How to enable batch job Recovery from job master failures + +- Enable cluster high availability: + + To enable the recovery of batch jobs from JobMaster failures, it is essential to first ensure that cluster + high availability (HA) is enabled. Flink supports HA services compatible with both ZooKeeper and Kubernetes. + For detailed configuration steps, please consult the section on [cluster high availability]({{< ref "docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}}) + in the Flink official documentation. +- Configure [execution.batch.job-recovery.enabled]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-enabled): true + +Note that currently only [Adaptive Batch Scheduler]({{< ref "docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) +supports this feature. And Flink batch jobs will use this scheduler by default unless another scheduler is explicitly configured. + +### Configuration Optimization + +To enable batch jobs to recover as much progress as possible after a JobMaster failover, and avoid rerunning tasks +that have already been finished, you can configure the following options for optimization: + +- [execution.batch.job-recovery.snapshot.min-pause]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-snapshot-min-pause): + This setting determines the minimum pause time allowed between snapshots for the OperatorCoordinator and ShuffleMaster. + This parameter could be adjusted based on the expected I/O load of your cluster and the tolerable amount of state regression. + Reduce this interval if smaller state regressions are preferred and a higher I/O load is acceptable. +- [execution.batch.job-recovery.previous-worker.recovery.timeout]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-previous-worker-recovery-timeout): + This setting determines the timeout duration allowed for Shuffle workers to reconnect. During the recovery process, Flink + requests the retained intermediate result data information from the Shuffle Master. If the timeout is reached, + Flink will use all the acquired intermediate result data to recover the state. +- [job-event.store.write-buffer.flush-interval]({{< ref "docs/deployment/config" >}}#job-event-store-write-buffer-flush-interval): + This setting determines the flush interval for the JobEventStore's write buffers. +- [job-event.store.write-buffer.size]({{< ref "docs/deployment/config" >}}#job-event-store-write-buffer-size): This + setting determines the write buffer size in the JobEventStore. When the buffer is full, its contents are flushed to the external + filesystem. + +## Limitations + +- Only compatible with the new Source (FLIP-27): Since the legacy Source has been deprecated, this feature is only + compatible with the new Source. +- Exclusive to the [Adaptive Batch Scheduler]({{< ref "docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler): + Currently, only the Adaptive Batch Scheduler supports the recovery of batch jobs after a + JobMaster failover. As a result, the feature inherits all the + [limitations of the Adaptive Batch Scheduler]({{<ref "docs/deployment/elastic_scaling" >}}#limitations-2). +- Remote Shuffle Service not supported. Review Comment: -> Not working when using remote shuffle services. ########## docs/content/docs/ops/batch/recovery_from_job_master_failure.md: ########## @@ -0,0 +1,99 @@ +--- +title: "Recovery from job master failures" +weight: 4 +type: docs +aliases: + +- /docs/ops/batch/recovery_from_job_master_failure.html +- /docs/ops/batch/recovery_from_job_master_failure + +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Batch jobs Recovery from job master failures + +## Background + +Currently, if the JobMaster fails and is terminated, one of the following two situations will occur: + +- If high availability (HA) is disabled, the job will fail. +- If HA is enabled, a JobMaster failover will happen and the job will be restarted. For streaming jobs, it can resume + from the last successful checkpoint. However, for batch jobs, which do not have checkpoints, it will have to start + over from the beginning, losing all previously made progress. This represents a significant regression for + long-running batch jobs. + +To address this issue, we introduced a feature for the recovery of batch jobs after a JobMaster failover in Flink +version 1.20. The main purpose of this feature is to enable batch jobs to recover as much progress as possible after a JobMaster +failover, avoiding the need to rerun tasks that have already been finished. + +To implement this feature, we introduced a JobEventStore component. This component allows Flink to record state change +events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to an external filesystem. During the crash +and subsequent restart of the JobMaster, TaskManagers will retain the intermediate result +data produced by the job and attempt to reconnect continuously. Once the JobMaster restarts, it will re-establish +connections with TaskManagers and recover the job state based on the retained +intermediate results and the events previously recorded in the JobEventStore, thereby resuming the job's execution +progress. + +## Usage + +This section explains how to enable recovery of batch jobs from JobMaster failures and how to use +configuration options to optimize the recovery process. + +### How to enable batch job Recovery from job master failures + +- Enable cluster high availability: + + To enable the recovery of batch jobs from JobMaster failures, it is essential to first ensure that cluster + high availability (HA) is enabled. Flink supports HA services compatible with both ZooKeeper and Kubernetes. + For detailed configuration steps, please consult the section on [cluster high availability]({{< ref "docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}}) + in the Flink official documentation. +- Configure [execution.batch.job-recovery.enabled]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-enabled): true + +Note that currently only [Adaptive Batch Scheduler]({{< ref "docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) +supports this feature. And Flink batch jobs will use this scheduler by default unless another scheduler is explicitly configured. + +### Configuration Optimization + +To enable batch jobs to recover as much progress as possible after a JobMaster failover, and avoid rerunning tasks +that have already been finished, you can configure the following options for optimization: + +- [execution.batch.job-recovery.snapshot.min-pause]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-snapshot-min-pause): + This setting determines the minimum pause time allowed between snapshots for the OperatorCoordinator and ShuffleMaster. + This parameter could be adjusted based on the expected I/O load of your cluster and the tolerable amount of state regression. + Reduce this interval if smaller state regressions are preferred and a higher I/O load is acceptable. +- [execution.batch.job-recovery.previous-worker.recovery.timeout]({{< ref "docs/deployment/config" >}}#execution-batch-job-recovery-previous-worker-recovery-timeout): + This setting determines the timeout duration allowed for Shuffle workers to reconnect. During the recovery process, Flink + requests the retained intermediate result data information from the Shuffle Master. If the timeout is reached, + Flink will use all the acquired intermediate result data to recover the state. +- [job-event.store.write-buffer.flush-interval]({{< ref "docs/deployment/config" >}}#job-event-store-write-buffer-flush-interval): + This setting determines the flush interval for the JobEventStore's write buffers. +- [job-event.store.write-buffer.size]({{< ref "docs/deployment/config" >}}#job-event-store-write-buffer-size): This + setting determines the write buffer size in the JobEventStore. When the buffer is full, its contents are flushed to the external + filesystem. + +## Limitations + +- Only compatible with the new Source (FLIP-27): Since the legacy Source has been deprecated, this feature is only Review Comment: compatible with -> working with -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
