Re: [PR] [Feature][seatunnel core][seatunnel-flink-start] Filnk engine supports task restore from checkpoint upon startup [seatunnel]

via GitHub Mon, 02 Feb 2026 04:52:02 -0800


DanielCarter-stack commented on PR #10232:
URL: https://github.com/apache/seatunnel/pull/10232#issuecomment-3834939829


   <!-- code-pr-reviewer -->
   <!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10232", "part": 1, 
"total": 1} -->
   ### Issue 1: Missing checkpoint path pre-validation
   
   **Location**: 
`seatunnel-core/seatunnel-flink-starter/seatunnel-flink-starter-common/src/main/java/org/apache/seatunnel/core/starter/flink/AbstractFlinkStarter.java:第7行附近`
   
   **Related context**:
   - Caller: command line startup entry
   - Downstream dependency: Flink `ExecutionEnvironment.configure()`
   - Affected configuration: `FlinkCommandArgs.restoreFromCheckpoint`
   
   **Problem description**:
   The code does not perform any validation before passing the 
`restoreFromCheckpoint` parameter to Flink. When users pass the following 
values, it can lead to unexpected errors:
   1. Empty string `""`
   2. Whitespace only `"   "`
   3. Non-existent path
   4. Malformed URI
   
   Flink will throw an exception, but the error message may not be clear 
enough, making it difficult for users to distinguish between path errors and 
other issues.
   
   **Potential risks**:
   - **Risk 1**: Poor user experience, error messages are not user-friendly
   - **Risk 2**: Difficult to debug, unable to quickly identify whether it's a 
path issue or permission issue
   - **Risk 3**: On resource-constrained clusters, may waste one job submission 
attempt
   
   **Scope of impact**:
   - **Direct impact**: `AbstractFlinkStarter`, `FlinkCommandArgs`
   - **Indirect impact**: All users using the `--restore-from-checkpoint` 
parameter
   - **Impact area**: Single module
   
   **Severity**: **MAJOR**
   
   **Improvement suggestions**:
   ```java
   // Add validation logic in AbstractFlinkStarter.java
   private void validateCheckpointPath(String checkpointPath) {
       if (checkpointPath != null && !checkpointPath.isBlank()) {
           // Check path format
           if (!checkpointPath.contains("://")) {
               throw new IllegalArgumentException(
                   "Checkpoint path must be a valid URI (e.g., hdfs://path or 
s3://path), " +
                   "but got: " + checkpointPath);
           }
           
           // Optional: Check if path exists (requires file system access)
           // Note: This may introduce performance overhead and needs trade-off 
consideration
           try {
               FileSystem fs = FileSystem.get(new URI(checkpointPath));
               if (!fs.exists(new Path(checkpointPath))) {
                   throw new IllegalArgumentException(
                       "Checkpoint path does not exist: " + checkpointPath);
               }
           } catch (Exception e) {
               throw new IllegalArgumentException(
                   "Failed to access checkpoint path: " + checkpointPath, e);
           }
       }
   }
   
   // Call before setting checkpoint
   if (args.getRestoreFromCheckpoint().isPresent()) {
       String checkpointPath = args.getRestoreFromCheckpoint().get();
       validateCheckpointPath(checkpointPath);
       // ... Continue setting Flink configuration
   }
   ```
   
   **Rationale**:
   - Early validation can provide clearer error messages
   - Reduce unnecessary job submission attempts
   - For non-existent paths, fail early instead of waiting until the job starts 
to report an error
   
   ---
   
   ### Issue 2: Missing key integration tests
   
   **Location**: 
`seatunnel-core/seatunnel-flink-starter/seatunnel-flink-starter-common/src/test/java/org/apache/seatunnel/core/starter/flink/`
   
   **Related context**:
   - Test class: `FlinkCommandArgsTest.java`
   - Class under test: `AbstractFlinkStarter.java`, `FlinkCommandArgs.java`
   - E2E test directory: `seatunnel-e2e/`
   
   **Problem description**:
   The current PR only adds unit tests for command line argument parsing, 
missing the following key test scenarios:
   
   1. **End-to-end recovery test**: Does not verify the complete process of 
recovering from checkpoint
   2. **Exception scenario tests**:
      - Checkpoint path does not exist
      - Checkpoint file corrupted
      - Incompatibility caused by schema changes
   3. **Exactly-once semantic verification**: Does not verify that data is 
neither duplicated nor lost after recovery
   4. **Different Flink version compatibility**: Does not test cross-version 
checkpoint recovery
   
   **Potential risks**:
   - **Risk 1**: Core functionality not fully verified, may fail in production 
environment
   - **Risk 2**: Exactly-once semantics broken, leading to data duplication or 
loss
   - **Risk 3**: After upgrading Flink version, old checkpoints may not be 
recoverable
   - **Risk 4**: Improper exception handling, causing jobs unable to recover
   
   **Scope of impact**:
   - **Direct impact**: All users using checkpoint recovery functionality
   - **Indirect impact**: Data consistency and reliability
   - **Impact area**: **Core framework functionality**
   
   **Severity**: **CRITICAL**
   
   **Improvement suggestions**:
   
   1. **Add integration tests** (reference structure of 
`MultiTableSinkTest.java`):
   
   ```java
   // Add new test class: CheckpointRestoreIT.java
   @Test
   public void testRestoreFromCheckpoint() throws Exception {
       // 1. Start the first job and write data
       String jobId1 = submitJob(configFile);
       waitForDataWritten(100);
       
       // 2. Trigger checkpoint
       String checkpointPath = triggerCheckpoint(jobId1);
       
       // 3. Continue writing more data
       waitForDataWritten(200);
       
       // 4. Stop the job
       cancelJob(jobId1);
       
       // 5. Restore from checkpoint
       String jobId2 = submitJobWithRestore(configFile, checkpointPath);
       
       // 6. Verify data consistency (total data should be 200, without 
duplication or loss)
       assertEquals(200, getRecordCount());
       
       // 7. Verify restored checkpoint state
       assertJobRestoredFromCheckpoint(jobId2, checkpointPath);
   }
   ```
   
   2. **Add exception scenario tests**:
   
   ```java
   @Test(expected = IllegalArgumentException.class)
   public void testInvalidCheckpointPath() {
       submitJobWithRestore(configFile, "invalid://non-existent-path");
   }
   
   @Test(expected = IllegalArgumentException.class)
   public void testEmptyCheckpointPath() {
       submitJobWithRestore(configFile, "");
   }
   ```
   
   3. **Add E2E tests** (in `seatunnel-e2e` module):
   
   ```bash
   # Add test script: 
seatunnel-e2e/seatunnel-flink-e2e/src/test/resources/checkpoint_restore.conf
   # Configure source and sink to verify exactly-once semantics
   ```
   
   **Rationale**:
   - Checkpoint recovery is a **critical fault tolerance feature** in 
production, insufficient testing may lead to data loss
   - Exactly-once semantics is a core guarantee of stream processing and must 
be verified through E2E tests
   - Exception scenarios will inevitably occur in actual operations and must be 
tested in advance
   
   ---
   
   ### Issue 3: Documentation missing key information
   
   **Location**:
   - `docs/en/engines/flink.md`
   - `docs/zh/engines/flink.md`
   - `docs/en/engines/command/usage.mdx`
   
   **Related context**:
   - User documentation
   - Operations manual
   - Troubleshooting guide
   
   **Problem description**:
   The documentation is missing the following key information, making it 
difficult for users and operators to correctly use this feature:
   
   1. **Prerequisites not explained**:
      - Need to enable checkpoint in Flink configuration in advance
      - Need to configure checkpoint storage backend (HDFS, S3, etc.)
      - Checkpoint path must be accessible to the cluster
   
   2. **Usage guide incomplete**:
      - Does not explain how to find saved checkpoint paths
      - Does not explain methods to verify if checkpoint is valid
      - Does not explain indicators of successful recovery
   
   3. **Troubleshooting missing**:
      - Common causes of recovery failure (path does not exist, insufficient 
permissions, schema mismatch)
      - How to diagnose if checkpoint is corrupted
      - How to handle recovery failure caused by schema changes
   
   4. **Version compatibility not explained**:
      - Checkpoint compatibility between different Flink versions
      - Precautions when upgrading SeaTunnel version
   
   5. **Parameter combination usage not explained**:
      - Usage scenarios of `--allow-non-restored-state` parameter
      - How to handle added or removed operators
   
   **Potential risks**:
   - **Risk 1**: Users cannot correctly use the feature, leading to recovery 
failure
   - **Risk 2**: Operators cannot quickly locate and resolve issues
   - **Risk 3**: Prolonged recovery time during production environment failures
   - **Risk 4**: Data loss due to misuse
   
   **Scope of impact**:
   - **Direct impact**: All users using this feature
   - **Indirect impact**: Operational efficiency, system reliability
   - **Impact area**: **User documentation and maintainability**
   
   **Severity**: **MAJOR**
   
   **Improvement suggestions**:
   
   Add complete sections in `docs/en/engines/flink.md`:
   
   ```markdown
   # # Checkpoint Recovery
   
   # ## Prerequisites
   
   在使用checkpoint恢复功能前，请确保：
   
   1. **启用Checkpoint**：在Flink配置文件（`flink-conf.yaml`）中启用checkpoint
      ```yaml
      state.backend: filesystem
      state.checkpoints.dir: hdfs://flink/checkpoints
      state.savepoints.dir: hdfs://flink/savepoints
      ```
   
   2. **配置存储后端**：确保HDFS、S3或其他存储系统可访问
   
   3. **首次运行作业**：必须先运行一次作业并生成checkpoint
   
   # ## Usage
   
   # ### 1. Submit job and generate checkpoint
   ```bash
   
   ### Issue 4: Missing verification logs for successful recovery
   
   **Location**: 
`seatunnel-core/seatunnel-flink-starter/seatunnel-flink-starter-common/src/main/java/org/apache/seatunnel/core/starter/flink/AbstractFlinkStarter.java`
   
   **Related context**:
   - Logging system: SLF4J + Logback
   - Monitoring system: Flink Metrics
   
   **Problem description**:
   The code is missing key log outputs, making it impossible for operators to 
confirm:
   1. Whether the user specified checkpoint recovery parameters
   2. What the checkpoint path is
   3. Whether recovery was successful
   4. If failed, what the reason for failure is
   
   The current implementation relies on Flink's own logs, but these logs may 
not be clear enough or be buried in large amounts of logs.
   
   **Potential risks**:
   - **Risk 1**: Unable to quickly confirm whether recovery was successful
   - **Risk 2**: Difficult to troubleshoot problems
   - **Risk 3**: Incomplete audit trail
   
   **Scope of impact**:
   - **Direct impact**: Operations and troubleshooting
   - **Indirect impact**: System observability
   - **Impact area**: Single module
   
   **Severity**: **MINOR**
   
   **Improvement suggestions**:
   
   ```java
   // Add logging in AbstractFlinkStarter.java
   private static final Logger LOG = 
LoggerFactory.getLogger(AbstractFlinkStarter.class);
   
   public void execute(FlinkCommandArgs args) throws Exception {
       // ... Existing code ...
       
       // Add checkpoint recovery logging
       if (args.getRestoreFromCheckpoint().isPresent()) {
           String checkpointPath = args.getRestoreFromCheckpoint().get();
           LOG.info("Restoring Flink job from checkpoint: {}", checkpointPath);
           
           try {
               validateCheckpointPath(checkpointPath);
               
               // Set Flink configuration
               Configuration flinkConfig = new Configuration();
               flinkConfig.set(SavepointConfigOptions.SAVEPOINT_PATH, 
checkpointPath);
               
               LOG.info("Checkpoint restore configuration applied 
successfully");
           } catch (Exception e) {
               LOG.error("Failed to configure checkpoint restore from path: 
{}", 
                         checkpointPath, e);
               throw e;
           }
       } else {
           LOG.debug("No checkpoint restore path specified, starting new job");
       }
       
       // ... Continue job submission ...
   }
   ```
   
   **Rationale**:
   - Improve observability
   - Facilitate troubleshooting
   - Comply with logging standards for production-grade code
   
   ---
   
   ### Issue 5: Missing compatibility validation for two-phase commit connectors
   
   **Location**: `AbstractFlinkStarter.java` (overall recovery process)
   
   **Related context**:
   - SeaTunnel two-phase commit mechanism
   - Connectors supporting two-phase commit (e.g., Kafka sink)
   - Interaction between Flink Checkpoint and two-phase commit
   
   **Problem description**:
   Some SeaTunnel connectors (e.g., Kafka sink) use the two-phase commit 
protocol to guarantee exactly-once semantics. When recovering from checkpoint, 
must ensure:
   
   1. Unfinished transactions are correctly rolled back
   2. Transaction ID continuity after recovery
   3. Atomicity between checkpoints and transaction state
   
   The current PR does not verify the compatibility of the recovery mechanism 
with two-phase commit, which may lead to:
   - Data duplication after recovery
   - Data loss after recovery
   - Inconsistent transaction state
   
   **Potential risks**:
   - **Risk 1**: Exactly-once semantics broken, data duplication or loss
   - **Risk 2**: Production environment data consistency issues
   - **Risk 3**: Difficult to diagnose sporadic errors
   
   **Scope of impact**:
   - **Direct impact**: Users using two-phase commit connectors (Kafka, Pulsar, 
etc.)
   - **Indirect impact**: Data consistency
   - **Impact area**: **Multiple connectors and core framework**
   
   **Severity**: **CRITICAL**
   
   **Improvement suggestions**:
   
   1. **Add compatibility tests**:
   
   ```java
   // Add in E2E test
   @Test
   public void testTwoPhaseCommitWithCheckpointRestore() throws Exception {
       // Use Kafka sink (supports two-phase commit)
       String configFile = "kafka_sink_with_2pc.conf";
       
       // 1. Submit job and write data
       String jobId1 = submitJob(configFile);
       waitForRecordsInKafka(100);
       
       // 2. Trigger checkpoint (there may be uncommitted transactions at this 
point)
       String checkpointPath = triggerCheckpoint(jobId1);
       
       // 3. Continue writing
       waitForRecordsInKafka(200);
       
       // 4. Stop the job
       cancelJob(jobId1);
       
       // 5. Restore from checkpoint
       String jobId2 = submitJobWithRestore(configFile, checkpointPath);
       waitForRecordsInKafka(300);
       
       // 6. Verify exactly-once semantics
       long recordCount = countKafkaRecords();
       assertEquals(300, recordCount, 
           "Data should be exactly-once after restore, but got: " + 
recordCount);
       
       // 7. Verify no duplicate data
       assertNoDuplicateRecords();
   }
   ```
   
   2. **Explain precautions in documentation**:
   
   ```markdown
   # ## Checkpoint Recovery and Two-Phase Commit
   
   如果你的Sink Connector使用了两阶段提交（如Kafka、Pulsar），请注意：
   
   1. **事务清理**：恢复时Flink会自动清理未完成的事务
   2. **状态一致性**：确保Connector正确实现了`CheckpointListener`接口
   3. **验证步骤**：
      - 恢复后检查Sink端是否有重复数据
      - 验证事务日志确认未提交的事务已回滚
   ```
   
   **Rationale**:
   - Exactly-once is a core guarantee of stream processing
   - Two-phase commit is a key mechanism for implementing exactly-once
   - Must verify that checkpoint recovery does not break transaction state
   
   ---
   
   ### Issue 6: Command line argument conflicts not handled
   
   **Location**: `FlinkCommandArgs.java`
   
   **Related context**:
   - Command line arguments: `--restore-from-checkpoint`
   - Potentially conflicting arguments: `--from-savepoint` (Flink native 
arguments, if exist)
   
   **Problem description**:
   If users specify multiple recovery-related parameters at the same time, or 
specify both checkpoint recovery and other incompatible parameters, the code 
does not detect and handle conflicts.
   
   For example:
   - User specifies both `--restore-from-checkpoint` and Flink native 
`--from-savepoint` (if supported in the future)
   - User specifies checkpoint recovery but specifies incompatible running mode
   
   **Potential risks**:
   - **Risk 1**: Argument conflicts lead to unexpected behavior
   - **Risk 2**: User confusion about which parameter takes effect
   
   **Scope of impact**:
   - **Direct impact**: Command line argument parsing
   - **Impact area**: Single module
   
   **Severity**: **MINOR**
   
   **Improvement suggestions**:
   
   ```java
   // Add parameter conflict detection in FlinkCommandArgs.java
   public void validate() {
       if (restoreFromCheckpoint.isPresent()) {
           // Check for conflicts with other parameters
           if (otherConflictingParam.isPresent()) {
               throw new IllegalArgumentException(
                   "Cannot specify both --restore-from-checkpoint and 
--other-param");
           }
       }
   }
   ```
   
   **Rationale**:
   - Improve user experience
   - Avoid unexpected behavior caused by argument conflicts
   
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][seatunnel core][seatunnel-flink-start] Filnk engine supports task restore from checkpoint upon startup [seatunnel]

Reply via email to