Re: [PR] [FLINK-35399][doc] Add documents for batch job recovery from jobMaster failover. [flink]

via GitHub Tue, 28 May 2024 10:22:07 -0700


zhuzhurk commented on code in PR #24855:
URL: https://github.com/apache/flink/pull/24855#discussion_r1617628663



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start

Review Comment:
   For streaming jobs, it can ...  However, for batch jobs, which do not have 
checkpoints, it will ...
   
   -> Streaming jobs can resume from latest successful checkpoints. Batch jobs, 
however, do not have checkpoints and have to start over from the beginning, 
losing all previously made progress.



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster
+failover, avoiding the need to rerun tasks that have already been finished.
+
+To implement this feature, we introduced a JobEventStore component. This 
component allows Flink to record state change

Review Comment:
   we introduced a JobEventStore component. This component allows Flink record
   
   -> a JobEventStore component is introduced to record



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster
+failover, avoiding the need to rerun tasks that have already been finished.
+
+To implement this feature, we introduced a JobEventStore component. This 
component allows Flink to record state change
+events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to 
an external filesystem. During the crash
+and subsequent restart of the JobMaster, TaskManagers will retain the 
intermediate result
+data produced by the job and attempt to reconnect continuously. Once the 
JobMaster restarts, it will re-establish
+connections with TaskManagers and recover the job state based on the retained
+intermediate results and the events previously recorded in the JobEventStore, 
thereby resuming the job's execution
+progress.
+
+## Usage
+
+This section explains how to enable recovery of batch jobs from JobMaster 
failures and how to use
+configuration options to optimize the recovery process.
+
+### How to enable batch job Recovery from job master failures
+
+- Enable cluster high availability:
+
+  To enable the recovery of batch jobs from JobMaster failures, it is 
essential to first ensure that cluster
+  high availability (HA) is enabled. Flink supports HA services compatible 
with both ZooKeeper and Kubernetes.
+  For detailed configuration steps, please consult the section on [cluster 
high availability]({{< ref 
"docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}})
+  in the Flink official documentation.
+- Configure [execution.batch.job-recovery.enabled]({{< ref 
"docs/deployment/config" >}}#execution-batch-job-recovery-enabled): true
+
+Note that currently only [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) 
+supports this feature. And Flink batch jobs will use this scheduler by default 
unless another scheduler is explicitly configured.
+
+### Configuration Optimization
+
+To enable batch jobs to recover as much progress as possible after a JobMaster 
failover, and avoid rerunning tasks 
+that have already been finished, you can configure the following options for 
optimization:
+
+- [execution.batch.job-recovery.snapshot.min-pause]({{< ref 
"docs/deployment/config" >}}#execution-batch-job-recovery-snapshot-min-pause):
+  This setting determines the minimum pause time allowed between snapshots for 
the OperatorCoordinator and ShuffleMaster.
+  This parameter could be adjusted based on the expected I/O load of your 
cluster and the tolerable amount of state regression. 
+  Reduce this interval if smaller state regressions are preferred and a higher 
I/O load is acceptable.
+- [execution.batch.job-recovery.previous-worker.recovery.timeout]({{< ref 
"docs/deployment/config" 
>}}#execution-batch-job-recovery-previous-worker-recovery-timeout):
+  This setting determines the timeout duration allowed for Shuffle workers to 
reconnect. During the recovery process, Flink 
+  requests the retained intermediate result data information from the Shuffle 
Master. If the timeout is reached, 
+  Flink will use all the acquired intermediate result data to recover the 
state.
+- [job-event.store.write-buffer.flush-interval]({{< ref 
"docs/deployment/config" >}}#job-event-store-write-buffer-flush-interval):
+  This setting determines the flush interval for the JobEventStore's write 
buffers.
+- [job-event.store.write-buffer.size]({{< ref "docs/deployment/config" 
>}}#job-event-store-write-buffer-size): This 
+  setting determines the write buffer size in the JobEventStore. When the 
buffer is full, its contents are flushed to the external
+  filesystem.
+
+## Limitations
+
+- Only compatible with the new Source (FLIP-27): Since the legacy Source has 
been deprecated, this feature is only 
+  compatible with the new Source.
+- Exclusive to the [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler): 
+  Currently, only the Adaptive Batch Scheduler supports the recovery of batch 
jobs after a
+  JobMaster failover. As a result, the feature inherits all the
+  [limitations of the Adaptive Batch Scheduler]({{<ref 
"docs/deployment/elastic_scaling" >}}#limitations-2).

Review Comment:
   IIRC, `SupportsBatchSnapshot` is also important for this feature to work 
better.



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster
+failover, avoiding the need to rerun tasks that have already been finished.
+
+To implement this feature, we introduced a JobEventStore component. This 
component allows Flink to record state change
+events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to 
an external filesystem. During the crash
+and subsequent restart of the JobMaster, TaskManagers will retain the 
intermediate result
+data produced by the job and attempt to reconnect continuously. Once the 
JobMaster restarts, it will re-establish
+connections with TaskManagers and recover the job state based on the retained
+intermediate results and the events previously recorded in the JobEventStore, 
thereby resuming the job's execution
+progress.
+
+## Usage
+
+This section explains how to enable recovery of batch jobs from JobMaster 
failures and how to use
+configuration options to optimize the recovery process.
+
+### How to enable batch job Recovery from job master failures
+
+- Enable cluster high availability:
+
+  To enable the recovery of batch jobs from JobMaster failures, it is 
essential to first ensure that cluster
+  high availability (HA) is enabled. Flink supports HA services compatible 
with both ZooKeeper and Kubernetes.
+  For detailed configuration steps, please consult the section on [cluster 
high availability]({{< ref 
"docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}})

Review Comment:
   For detailed configuration steps, please consult the section on cluster high 
availability in the Flink official documentation.
   
   -> More details of the configuration can be found in the [High Availability] 
page.



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster

Review Comment:
   To address this issue, we introduced a feature for the recovery of batch 
jobs after a JobMaster failover in Flink 
   version 1.20. The main purpose of this feature is to enable ...
   
   -> To address this issue, a batch job recovery mechanism is introduced to 
enable ...



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:

Review Comment:
   Currently -> Previously
   
   This will be the feature document of Flink 1.20.



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster
+failover, avoiding the need to rerun tasks that have already been finished.
+
+To implement this feature, we introduced a JobEventStore component. This 
component allows Flink to record state change
+events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to 
an external filesystem. During the crash
+and subsequent restart of the JobMaster, TaskManagers will retain the 
intermediate result
+data produced by the job and attempt to reconnect continuously. Once the 
JobMaster restarts, it will re-establish
+connections with TaskManagers and recover the job state based on the retained
+intermediate results and the events previously recorded in the JobEventStore, 
thereby resuming the job's execution
+progress.
+
+## Usage
+
+This section explains how to enable recovery of batch jobs from JobMaster 
failures and how to use
+configuration options to optimize the recovery process.
+
+### How to enable batch job Recovery from job master failures
+
+- Enable cluster high availability:
+
+  To enable the recovery of batch jobs from JobMaster failures, it is 
essential to first ensure that cluster
+  high availability (HA) is enabled. Flink supports HA services compatible 
with both ZooKeeper and Kubernetes.

Review Comment:
   Flink supports HA services compatible with both ZooKeeper and Kubernetes
   
   -> Flink supports HA services backed by ZooKeeper or Kubernetes.



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster
+failover, avoiding the need to rerun tasks that have already been finished.
+
+To implement this feature, we introduced a JobEventStore component. This 
component allows Flink to record state change
+events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to 
an external filesystem. During the crash
+and subsequent restart of the JobMaster, TaskManagers will retain the 
intermediate result
+data produced by the job and attempt to reconnect continuously. Once the 
JobMaster restarts, it will re-establish
+connections with TaskManagers and recover the job state based on the retained
+intermediate results and the events previously recorded in the JobEventStore, 
thereby resuming the job's execution
+progress.
+
+## Usage
+
+This section explains how to enable recovery of batch jobs from JobMaster 
failures and how to use
+configuration options to optimize the recovery process.
+
+### How to enable batch job Recovery from job master failures
+
+- Enable cluster high availability:
+
+  To enable the recovery of batch jobs from JobMaster failures, it is 
essential to first ensure that cluster
+  high availability (HA) is enabled. Flink supports HA services compatible 
with both ZooKeeper and Kubernetes.
+  For detailed configuration steps, please consult the section on [cluster 
high availability]({{< ref 
"docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}})
+  in the Flink official documentation.
+- Configure [execution.batch.job-recovery.enabled]({{< ref 
"docs/deployment/config" >}}#execution-batch-job-recovery-enabled): true
+
+Note that currently only [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) 
+supports this feature. And Flink batch jobs will use this scheduler by default 
unless another scheduler is explicitly configured.
+
+### Configuration Optimization
+
+To enable batch jobs to recover as much progress as possible after a JobMaster 
failover, and avoid rerunning tasks 
+that have already been finished, you can configure the following options for 
optimization:
+
+- [execution.batch.job-recovery.snapshot.min-pause]({{< ref 
"docs/deployment/config" >}}#execution-batch-job-recovery-snapshot-min-pause):
+  This setting determines the minimum pause time allowed between snapshots for 
the OperatorCoordinator and ShuffleMaster.
+  This parameter could be adjusted based on the expected I/O load of your 
cluster and the tolerable amount of state regression. 
+  Reduce this interval if smaller state regressions are preferred and a higher 
I/O load is acceptable.
+- [execution.batch.job-recovery.previous-worker.recovery.timeout]({{< ref 
"docs/deployment/config" 
>}}#execution-batch-job-recovery-previous-worker-recovery-timeout):
+  This setting determines the timeout duration allowed for Shuffle workers to 
reconnect. During the recovery process, Flink 
+  requests the retained intermediate result data information from the Shuffle 
Master. If the timeout is reached, 
+  Flink will use all the acquired intermediate result data to recover the 
state.
+- [job-event.store.write-buffer.flush-interval]({{< ref 
"docs/deployment/config" >}}#job-event-store-write-buffer-flush-interval):
+  This setting determines the flush interval for the JobEventStore's write 
buffers.
+- [job-event.store.write-buffer.size]({{< ref "docs/deployment/config" 
>}}#job-event-store-write-buffer-size): This 
+  setting determines the write buffer size in the JobEventStore. When the 
buffer is full, its contents are flushed to the external
+  filesystem.
+
+## Limitations
+
+- Only compatible with the new Source (FLIP-27): Since the legacy Source has 
been deprecated, this feature is only 
+  compatible with the new Source.
+- Exclusive to the [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler): 
+  Currently, only the Adaptive Batch Scheduler supports the recovery of batch 
jobs after a
+  JobMaster failover. As a result, the feature inherits all the
+  [limitations of the Adaptive Batch Scheduler]({{<ref 
"docs/deployment/elastic_scaling" >}}#limitations-2).
+- Remote Shuffle Service not supported.

Review Comment:
   -> Not working when using remote shuffle services.



##########
docs/content/docs/ops/batch/recovery_from_job_master_failure.md:
##########
@@ -0,0 +1,99 @@
+---
+title: "Recovery from job master failures"
+weight: 4
+type: docs
+aliases:
+
+- /docs/ops/batch/recovery_from_job_master_failure.html
+- /docs/ops/batch/recovery_from_job_master_failure
+
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Batch jobs Recovery from job master failures
+
+## Background
+
+Currently, if the JobMaster fails and is terminated, one of the following two 
situations will occur:
+
+- If high availability (HA) is disabled, the job will fail.
+- If HA is enabled, a JobMaster failover will happen and the job will be 
restarted. For streaming jobs, it can resume 
+  from the last successful checkpoint. However, for batch jobs, which do not 
have checkpoints, it will have to start
+  over from the beginning, losing all previously made progress. This 
represents a significant regression for
+  long-running batch jobs.
+
+To address this issue, we introduced a feature for the recovery of batch jobs 
after a JobMaster failover in Flink 
+version 1.20. The main purpose of this feature is to enable batch jobs to 
recover as much progress as possible after a JobMaster
+failover, avoiding the need to rerun tasks that have already been finished.
+
+To implement this feature, we introduced a JobEventStore component. This 
component allows Flink to record state change
+events of the JobMaster (such as ExecutionGraph, OperatorCoordinator, etc.) to 
an external filesystem. During the crash
+and subsequent restart of the JobMaster, TaskManagers will retain the 
intermediate result
+data produced by the job and attempt to reconnect continuously. Once the 
JobMaster restarts, it will re-establish
+connections with TaskManagers and recover the job state based on the retained
+intermediate results and the events previously recorded in the JobEventStore, 
thereby resuming the job's execution
+progress.
+
+## Usage
+
+This section explains how to enable recovery of batch jobs from JobMaster 
failures and how to use
+configuration options to optimize the recovery process.
+
+### How to enable batch job Recovery from job master failures
+
+- Enable cluster high availability:
+
+  To enable the recovery of batch jobs from JobMaster failures, it is 
essential to first ensure that cluster
+  high availability (HA) is enabled. Flink supports HA services compatible 
with both ZooKeeper and Kubernetes.
+  For detailed configuration steps, please consult the section on [cluster 
high availability]({{< ref 
"docs/deployment/ha/overview#how-to-make-a-cluster-highly-available" >}})
+  in the Flink official documentation.
+- Configure [execution.batch.job-recovery.enabled]({{< ref 
"docs/deployment/config" >}}#execution-batch-job-recovery-enabled): true
+
+Note that currently only [Adaptive Batch Scheduler]({{< ref 
"docs/deployment/elastic_scaling" >}}#adaptive-batch-scheduler) 
+supports this feature. And Flink batch jobs will use this scheduler by default 
unless another scheduler is explicitly configured.
+
+### Configuration Optimization
+
+To enable batch jobs to recover as much progress as possible after a JobMaster 
failover, and avoid rerunning tasks 
+that have already been finished, you can configure the following options for 
optimization:
+
+- [execution.batch.job-recovery.snapshot.min-pause]({{< ref 
"docs/deployment/config" >}}#execution-batch-job-recovery-snapshot-min-pause):
+  This setting determines the minimum pause time allowed between snapshots for 
the OperatorCoordinator and ShuffleMaster.
+  This parameter could be adjusted based on the expected I/O load of your 
cluster and the tolerable amount of state regression. 
+  Reduce this interval if smaller state regressions are preferred and a higher 
I/O load is acceptable.
+- [execution.batch.job-recovery.previous-worker.recovery.timeout]({{< ref 
"docs/deployment/config" 
>}}#execution-batch-job-recovery-previous-worker-recovery-timeout):
+  This setting determines the timeout duration allowed for Shuffle workers to 
reconnect. During the recovery process, Flink 
+  requests the retained intermediate result data information from the Shuffle 
Master. If the timeout is reached, 
+  Flink will use all the acquired intermediate result data to recover the 
state.
+- [job-event.store.write-buffer.flush-interval]({{< ref 
"docs/deployment/config" >}}#job-event-store-write-buffer-flush-interval):
+  This setting determines the flush interval for the JobEventStore's write 
buffers.
+- [job-event.store.write-buffer.size]({{< ref "docs/deployment/config" 
>}}#job-event-store-write-buffer-size): This 
+  setting determines the write buffer size in the JobEventStore. When the 
buffer is full, its contents are flushed to the external
+  filesystem.
+
+## Limitations
+
+- Only compatible with the new Source (FLIP-27): Since the legacy Source has 
been deprecated, this feature is only 

Review Comment:
   compatible with -> working with



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-35399][doc] Add documents for batch job recovery from jobMaster failover. [flink]

Reply via email to