Re: [PR] [Fix][Zeta] Fix job restore failure after master switch when IMap sti… [seatunnel]

via GitHub Thu, 05 Mar 2026 01:25:29 -0800


DanielCarter-stack commented on PR #10562:
URL: https://github.com/apache/seatunnel/pull/10562#issuecomment-4003611773


   <!-- code-pr-reviewer -->
   <!-- cpr:pr_reply_v2_parts {"group": "apache/seatunnel#10562", "part": 1, 
"total": 1} -->
   ### Issue 1: Missing unit test coverage
   
   **Location**: 
   - 
`seatunnel-engine/seatunnel-engine-common/src/test/java/org/apache/seatunnel/engine/common/utils/ExceptionUtilTest.java`
   - 
`seatunnel-engine/seatunnel-engine-server/src/test/java/org/apache/seatunnel/engine/server/CoordinatorServiceTest.java`
   
   **Related Context**:
   - Modified class: `ExceptionUtil.java:155-161`
   - Modified class: `CoordinatorService.java:508-526`
   - Callers: `JobMaster.java:490-494, 686-690, 732-733` and many other places
   
   **Problem Description**:
   The PR modifies core exception judgment logic and job recovery process, but 
does not add any unit tests. Specifically missing:
   
   1. `ExceptionUtil.isOperationNeedRetryException` has no test cases to verify 
it can correctly identify `RetryableHazelcastException`
   2. `CoordinatorService.restoreJobFromMasterActiveSwitch` has no tests 
simulating IMap loading scenarios
   3. No tests to verify retry mechanism is triggered correctly
   4. No tests for behavior after retry count is exhausted
   
   **Potential Risks**:
   - **Risk 1**: Future modifications may accidentally break this logic, 
causing regressions
   - **Risk 2**: Cannot verify whether the fix is truly effective
   - **Risk 3**: Edge conditions (e.g., still failing after 30 retries) are not 
verified
   
   **Impact Scope**:
   - **Direct Impact**: `ExceptionUtil` and `CoordinatorService` classes
   - **Indirect Impact**: All scenarios that depend on master switch job 
recovery functionality
   - **Impact Area**: Core framework
   
   **Severity**: MAJOR
   
   **Improvement Suggestions**:
   
   ```java
   // Added in ExceptionUtilTest.java
   @Test
   void testIsOperationNeedRetryException_withRetryableHazelcastException() {
       RetryableHazelcastException exception = new 
RetryableHazelcastException("Test");
       assertTrue(ExceptionUtil.isOperationNeedRetryException(exception));
   }
   
   @Test
   void 
testIsOperationNeedRetryException_withWrappedRetryableHazelcastException() {
       Throwable exception = new Exception(
           new RuntimeException(new RetryableHazelcastException("Test"))
       );
       assertTrue(ExceptionUtil.isOperationNeedRetryException(exception));
   }
   
   @Test
   void testIsOperationNeedRetryException_withNonRetryableException() {
       Exception exception = new Exception("Non-retryable");
       assertFalse(ExceptionUtil.isOperationNeedRetryException(exception));
   }
   ```
   
   ```java
   // Added in CoordinatorServiceTest.java (need to mock Hazelcast IMap to 
throw RetryableHazelcastException)
   @Test
   public void testRestoreJobFromMasterActiveSwitch_withIMapLoadingRetry() {
       // Use Mockito to mock runningJobStateIMap
       // First 3 calls throw RetryableHazelcastException
       // 4th call returns normal jobState
       // Verify retry logic takes effect, job successfully recovered
   }
   
   @Test
   public void testRestoreJobFromMasterActiveSwitch_retryExhausted() {
       // Mock to always throw RetryableHazelcastException
       // Verify SeaTunnelEngineException is thrown after 30 retries
   }
   ```
   
   **Rationale**: Testing is critical to ensuring code quality and preventing 
regressions. Especially for bugs in distributed scenarios, manual testing is 
costly and difficult to cover all cases.
   
   ---
   
   ### Issue 2: Missing JavaDoc comments
   
   **Location**: 
`seatunnel-engine/seatunnel-engine-common/src/main/java/org/apache/seatunnel/engine/common/utils/ExceptionUtil.java:155`
   
   **Related Context**:
   - Method: `ExceptionUtil.isOperationNeedRetryException`
   - Callers: Retry logic in 9 different classes
   
   **Problem Description**:
   The `isOperationNeedRetryException` method lacks JavaDoc comments to explain:
   1. What is the purpose of this method
   2. Which exception types are considered to need retry
   3. Why these exceptions need retry
   4. How callers should use this method
   
   **Potential Risks**:
   - **Risk 1**: New developers may not understand the importance of this 
method and modify it incorrectly
   - **Risk 2**: May miss adding new retryable exception types
   - **Risk 3**: Difficult to maintain, unclear about specific scenarios for 
each retryable exception
   
   **Impact Scope**:
   - **Direct Impact**: Maintainability of `ExceptionUtil` class
   - **Indirect Impact**: All developers who use this method
   - **Impact Area**: Core framework
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   ```java
   /**
    * Check if an exception indicates an operation that should be retried.
    *
    * <p>This method is used by {@link RetryUtils} to determine if an operation
    * should be retried based on the exception thrown. It extracts the root 
cause
    * of the exception chain and checks if it matches known retryable exception 
types.
    *
    * <p>The following exception types are considered retryable:
    * <ul>
    *   <li>{@link HazelcastInstanceNotActiveException} - Hazelcast instance is 
not active</li>
    *   <li>{@link InterruptedException} - Operation was interrupted</li>
    *   <li>{@link OperationTimeoutException} - Operation timed out</li>
    *   <li>{@link RetryableHazelcastException} - Hazelcast explicitly marks 
this as retryable,
    *       e.g., when IMap is still loading from external storage (S3) in 
EAGER mode</li>
    * </ul>
    *
    * <p><b>Important:</b> When adding new retryable exception types, ensure 
that:
    * <ol>
    *   <li>The exception is truly transient and retrying may succeed</li>
    *   <li>Retrying does not cause duplicate operations or data 
inconsistency</li>
    *   <li>All call sites can handle the retry delay appropriately</li>
    * </ol>
    *
    * @param e the exception to check (may be wrapped in other exceptions)
    * @return {@code true} if the exception (or its root cause) is retryable; 
{@code false} otherwise
    * @see RetryUtils#retryWithException
    * @see org.apache.seatunnel.common.utils.ExceptionUtils#getRootException
    */
   public static boolean isOperationNeedRetryException(@NonNull Throwable e) {
       Throwable exception = ExceptionUtils.getRootException(e);
       return exception instanceof HazelcastInstanceNotActiveException
               || exception instanceof InterruptedException
               || exception instanceof OperationTimeoutException
               || exception instanceof RetryableHazelcastException;
   }
   ```
   
   **Rationale**: Comprehensive JavaDoc is crucial for public API methods, 
especially for core utility methods called from multiple places.
   
   ---
   
   ### Issue 3: PR description mentions "putIfAbsent" but it's not reflected in 
the code
   
   **Location**: PR description
   
   **Related Context**:
   - PR description: "Add RetryableHazelcastException to 
ExceptionUtil.isOperationNeedRetryException so SeaTunnel's RetryUtils retries 
putIfAbsent calls..."
   - Actual code: Only `runningJobStateIMap.get()` call is modified
   
   **Problem Description**:
   The PR description mentions that the fix will affect "putIfAbsent calls", 
but no `putIfAbsent` usage is seen in the actual code changes. This may mean:
   
   1. The PR description is inaccurate or outdated
   2. The "putIfAbsent" issue exists elsewhere (not included in this PR)
   3. The description confuses different fix points
   
   **Potential Risks**:
   - **Risk 1**: Code reviewers may be confused about the actual fix scope
   - **Risk 2**: If `putIfAbsent` has the same issue but is not fixed, the bug 
still exists
   - **Risk 3**: Future maintainers may misunderstand the actual impact of the 
PR
   
   **Impact Scope**:
   - **Direct Impact**: Understandability of the PR
   - **Indirect Impact**: May lead to incomplete fixes
   - **Impact Area**: Documentation and communication
   
   **Severity**: MINOR
   
   **Improvement Suggestions**:
   
   1. **Clarify PR description**: 
      - If the `putIfAbsent` issue indeed exists but is not within the scope of 
this PR, it should be explicitly stated in the PR description
      - If the description is outdated, it should be updated to accurately 
reflect the actual modifications
      
   2. **Code search verification**:
      ```bash
      # Search all usages of putIfAbsent in the project
      grep -rn "putIfAbsent" seatunnel-engine/
      ```
      
   3. **可能的描述更新**:
      ```
      - Add RetryableHazelcastException to 
ExceptionUtil.isOperationNeedRetryException
        so that all retry operations (including IMap access like get/put) can 
properly
        handle RetryableHazelcastException
      - Wrap runningJobStateIMap.get() in RetryUtils in 
restoreJobFromMasterActiveSwitch
        to handle RetryableHazelcastException during master switch
      ```
   
   ** Reason**: Accurate PR description is important for code review and future 
maintenance. If `putIfAbsent` has the same issue, it should be fixed together 
or a new Issue should be created to track it.
   
   ---
   
   # ## Issue 4: Error handling can be more refined
   
   ** Location**: 
`seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/CoordinatorService.java:519-522`
   
   ** Related context**:
   - 方法: `restoreJobFromMasterActiveSwitch`
   - 调用方: `restoreAllRunningJobFromMasterNodeSwitch` (line 483-485)
   
   ** Issue description**:
   当前的错误处理比较粗糙：
   ```java
   catch (Exception e) {
       throw new SeaTunnelEngineException(
           String.format("Job id %s restore failed, can not get job state", 
jobId), e);
   }
   ```
   
   这种处理方式有几个问题：
   
   1. **不区分异常类型**: 无论是 `RetryableHazelcastException` 重试耗尽，还是其他严重异常，都抛出相同的错误
   2. **丢失重试信息**: 没有记录实际重试了多少次
   3. **缺少降级策略**: 对于某些可恢复的错误，可能有更好的处理方式
   
   ** Potential risks**:
   - **风险 1**: 难以诊断问题，日志中看不到重试的细节
   - **风险 2**: 无法针对不同的失败原因采取不同的处理策略
   - **风险 3**: 运维人员无法快速定位问题根因
   
   ** Scope of impact**:
   - **直接影响**: `restoreJobFromMasterActiveSwitch` 的错误处理
   - **间接影响**: 作业恢复失败的可诊断性
   - **影响面**: 单个模块
   
   ** Severity**: MINOR
   
   ** Improvement suggestions**:
   ```java
   Object jobState;
   try {
       jobState = RetryUtils.retryWithException(
           () -> runningJobStateIMap.get(jobId),
           new RetryUtils.RetryMaterial(
               Constant.OPERATION_RETRY_TIME,
               true,
               ExceptionUtil::isOperationNeedRetryException,
               Constant.OPERATION_RETRY_SLEEP));
       // Add success log (if retries occurred)
       logger.info(String.format("Successfully retrieved job state for job %s", 
jobId));
   } catch (Exception e) {
       // Check if it's caused by RetryableHazelcastException
       Throwable rootCause = ExceptionUtils.getRootException(e);
       if (rootCause instanceof RetryableHazelcastException) {
           logger.severe(String.format(
               "Job %s restore failed after %d retries due to IMap still 
loading from S3. " +
               "This may indicate a configuration issue or S3 connectivity 
problem. " +
               "Consider increasing IMap load timeout or checking S3 
connectivity.",
               jobId, Constant.OPERATION_RETRY_TIME));
       } else {
           logger.severe(String.format(
               "Job %s restore failed due to unexpected error", jobId), e);
       }
       throw new SeaTunnelEngineException(
           String.format("Job id %s restore failed, can not get job state", 
jobId), e);
   }
   ```
   
   或者，如果需要更精细的控制：
   ```java
   // Customize RetryUtils to support retry callbacks
   Object jobState;
   int[] retryCount = {0};
   try {
       jobState = RetryUtils.retryWithException(
           () -> {
               retryCount[0]++;
               return runningJobStateIMap.get(jobId);
           },
           new RetryUtils.RetryMaterial(
               Constant.OPERATION_RETRY_TIME,
               true,
               ExceptionUtil::isOperationNeedRetryException,
               Constant.OPERATION_RETRY_SLEEP));
       
       if (retryCount[0] > 1) {
           logger.info(String.format(
               "Job state retrieved after %d retries for job %s (IMap was 
loading)",
               retryCount[0], jobId));
       }
   } catch (Exception e) {
       // ... error handling
   }
   ```
   
   ** Reason**: Better logging and error handling can significantly improve 
maintainability in production environments.
   
   ---
   
   # ## Issue 5: No consideration for permanent IMap unavailability
   
   ** Location**: 
`seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/CoordinatorService.java:508-526`
   
   ** Related context**:
   - 方法: `restoreJobFromMasterActiveSwitch`
   - 依赖: `runningJobStateIMap`, `runningJobInfoIMap`
   
   ** Issue description**:
   当前实现假设 IMap 最终会加载完成，但如果：
   1. S3 连接断开
   2. S3 存储损坏
   3. IMap 配置错误
   
   即使重试 30 次（60 秒），IMap 可能仍然不可用。此时：
   - 作业恢复会抛出异常
   - `runningJobInfoIMap` 中的条目被保留（如果在这一步之前就失败）
   - 作业处于"悬空"状态：既不在 `runningJobMasterMap` 中，也不在 `pendingJobQueue` 中
   
   ** Potential risks**:
   - **风险 1**: 作业卡在不可恢复的状态
   - **风险 2**: 需要手动清理 `runningJobInfoIMap` 和 `runningJobStateIMap`
   - **风险 3**: 没有告警机制通知运维人员
   
   ** Scope of impact**:
   - **直接影响**: 作业恢复的可靠性
   - **间接影响**: 集群的自我恢复能力
   - **影响面**: 核心框架
   
   ** Severity**: MAJOR
   
   ** Improvement suggestions**:
   
   可以考虑添加一个"标记机制"，将无法恢复的作业标记为需要人工介入：
   ```java
   private void restoreJobFromMasterActiveSwitch(@NonNull Long jobId, @NonNull 
JobInfo jobInfo) {
       Object jobState;
       try {
           jobState = RetryUtils.retryWithException(
               () -> runningJobStateIMap.get(jobId),
               new RetryUtils.RetryMaterial(
                   Constant.OPERATION_RETRY_TIME,
                   true,
                   ExceptionUtil::isOperationNeedRetryException,
                   Constant.OPERATION_RETRY_SLEEP));
       } catch (Exception e) {
           Throwable rootCause = ExceptionUtils.getRootException(e);
           
           // Check if it's RetryableHazelcastException (IMap loading issue)
           if (rootCause instanceof RetryableHazelcastException) {
               // Log to dedicated failure queue for subsequent processing or 
alerting
               logger.severe(String.format(
                   "Job %s restore failed due to IMap loading timeout after %d 
retries. " +
                   "Marking job as failed. JobInfo: %s",
                   jobId, Constant.OPERATION_RETRY_TIME, jobInfo));
               
               // Try to update job status to FAILED (if there's 
finishedJobStateIMap)
               try {
                   // Update finishedJobStateIMap, mark job as FAILED
                   // This way the job is visible in UI instead of completely 
disappearing
               } catch (Exception ex) {
                   logger.warning("Failed to mark job as failed", ex);
               }
           }
           
           throw new SeaTunnelEngineException(
               String.format("Job id %s restore failed, can not get job state", 
jobId), e);
       }
       
       if (jobState == null) {
           runningJobInfoIMap.remove(jobId);
           return;
       }
       
       // ... normal recovery logic
   }
   ```
   
   或者，可以添加一个配置选项，控制作业恢复失败时的行为：
   ```java
   // Add in EngineConfig
   public enum JobRestoreFailureStrategy {
       FAIL_FAST,           // Fail immediately (current behavior)
       MARK_AS_FAILED,      // Mark as FAILED, keep in history
       RETRY_INDEFINITELY   // Retry indefinitely (not recommended)
   }
   ```
   
   ** Reason**: Although this is an edge case, all failure scenarios need to be 
considered in production environments to ensure the system doesn't enter an 
inconsistent state.
   
   ---
   
   # ## Issue 6: Potential performance issue - undifferentiated 2-second retry 
interval
   
   ** Location**: 
`seatunnel-engine/seatunnel-engine-common/src/main/java/org/apache/seatunnel/engine/common/Constant.java:42`
 and `CoordinatorService.java:518`
   
   ** Related context**:
   - 常量定义: `Constant.OPERATION_RETRY_SLEEP = 2000` (2 秒)
   - 使用位置: `CoordinatorService`, `JobMaster` 等多处
   
   ** Issue description**:
   当前重试策略使用固定的 2 秒间隔，这在某些场景下可能不是最优的：
   
   1. **IMap 快速加载完成**: 如果 IMap 在 500ms 内就加载完成，浪费了 1.5 秒
   2. **IMap 加载很慢**: 如果 IMap 需要加载 10 秒，30 次重试可能不够
   3. **无指数退避**: 在 `RetryMaterial` 构造中没有启用 `sleepTimeIncrease`，导致没有指数退避
   
   ** Potential risks**:
   - **风险 1**: 在 IMap 快速加载完成时，不必要的延迟
   - **风险 2**: 在 IMap 加载缓慢时，可能重试次数不够
   - **风险 3**: 固定间隔可能加剧集群负载（如果在同一时间多个作业都在重试）
   
   ** Scope of impact**:
   - **直接影响**: 作业恢复的速度和成功率
   - **间接影响**: master switch 的整体恢复时间
   - **影响面**: 核心框架
   
   ** Severity**: MINOR
   
   ** Improvement suggestions**:
   
   可以考虑启用指数退避：
   ```java
   jobState = RetryUtils.retryWithException(
       () -> runningJobStateIMap.get(jobId),
       new RetryUtils.RetryMaterial(
           Constant.OPERATION_RETRY_TIME,    // 30
           true,
           ExceptionUtil::isOperationNeedRetryException,
           Constant.OPERATION_RETRY_SLEEP,   // 2000ms as base interval
           true));  // Enable sleepTimeIncrease (exponential backoff)
   ```
   
   这样重试间隔将是：
   - 第 1 次: 2 秒
   - 第 2 次: 4 秒
   - 第 3 次: 8 秒
   - 第 4 次: 16 秒
   - ...
   - 第 30 次: 最大 20 秒（受 `MAX_RETRY_TIME_MS` 限制）
   
   总等待时间会更合理，且在 IMap 快速加载完成时不会浪费太多时间。
   
   ** Or**, a different retry strategy can be used for this specific scenario:
   ```java
   // Add in Constant
   public static final int IMAP_LOADING_RETRY_TIME = 30;
   public static final int IMAP_LOADING_RETRY_SLEEP_INITIAL = 500;  // 500ms
   public static final int IMAP_LOADING_RETRY_SLEEP_MAX = 5000;     // 5s
   
   // In CoordinatorService
   jobState = RetryUtils.retryWithException(
       () -> runningJobStateIMap.get(jobId),
       new RetryUtils.RetryMaterial(
           Constant.IMAP_LOADING_RETRY_TIME,
           true,
           ExceptionUtil::isOperationNeedRetryException,
           Constant.IMAP_LOADING_RETRY_SLEEP_INITIAL,
           true));  // Enable exponential backoff
   ```
   
   ** Reason**: Exponential backoff is a best practice for handling transient 
failures, balancing recovery speed with system load.
   
   ---
   
   # ## Issue 7: Potential race condition - duplicate recovery check not robust 
enough
   
   ** Location**: 
`seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/CoordinatorService.java:481-485`
   
   ** Related context**:
   - 方法: `restoreAllRunningJobFromMasterNodeSwitch`
   - 调用链: `checkNewActiveMaster` -> `initCoordinatorService` -> 
`restoreAllRunningJobFromMasterNodeSwitch`
   
   ** Issue description**:
   在 `restoreAllRunningJobFromMasterNodeSwitch` 中有这样的检查：
   ```java
   // skip the job new submit
   if (!runningJobMasterMap.containsKey(entry.getKey())) {
       restoreJobFromMasterActiveSwitch(entry.getKey(), entry.getValue());
   }
   ```
   
   这个检查是为了避免重复恢复同一个作业，但它是在 `CompletableFuture.runAsync` 内部执行的，存在时序问题：
   
   1. **时序**: 线程 A 检查 `!runningJobMasterMap.containsKey(jobId)` 为 true，准备恢复
   2. **并发**: 同时线程 B 可能正在提交同一个新作业（通过 `submitJob`）
   3. **问题**: 两个线程可能同时处理同一个 jobId，导致冲突
   
   ** Potential risks**:
   - **风险 1**: 同一个作业可能被恢复两次
   - **风险 2**: 新提交的作业可能被意外覆盖
   - **风险 3**: 虽然概率较低，但在高并发场景下可能发生
   
   ** Scope of impact**:
   - **直接影响**: `restoreJobFromMasterActiveSwitch` 和 `submitJob` 的并发安全性
   - **间接影响**: 作业的一致性
   - **影响面**: 单个模块
   
   ** Severity**: MAJOR
   
   ** Improvement suggestions**:
   
   方案 1: 使用 `runningJobInfoIMap` 作为同步点
   ```java
   private void restoreJobFromMasterActiveSwitch(@NonNull Long jobId, @NonNull 
JobInfo jobInfo) {
       // Double-check: confirm again that the job hasn't been restored yet
       if (runningJobMasterMap.containsKey(jobId)) {
           logger.info(String.format("Job %s already restored, skipping", 
jobId));
           return;
       }
       
       // ... continue recovery logic
   }
   ```
   
   方案 2: 使用 `ConcurrentHashMap.putIfAbsent`
   ```java
   private void restoreAllRunningJobFromMasterNodeSwitch() {
       // ...
       List<CompletableFuture<Void>> collect =
           needRestoreFromMasterNodeSwitchJobs.stream()
               .filter(entry -> 
!runningJobMasterMap.containsKey(entry.getKey()))
               .map(entry -> {
                   // Try to mark as "restoring"
                   JobMaster existing = runningJobMasterMap.putIfAbsent(
                       entry.getKey(), 
                       RESTORE_MARKER  // A special marker object
                   );
                   
                   if (existing != null) {
                       // Already handled by another thread
                       return CompletableFuture.completedFuture(null);
                   }
                   
                   return CompletableFuture.runAsync(() -> {
                       try {
                           logger.info(String.format("begin restore job 
(%s)...", entry.getKey()));
                           restoreJobFromMasterActiveSwitch(entry.getKey(), 
entry.getValue());
                       } finally {
                           // After recovery completes, RESTORE_MARKER will be 
replaced by the actual JobMaster
                           // If recovery fails, need to clean up the marker
                       }
                   }, executorService);
               })
               .collect(Collectors.toList());
       // ...
   }
   ```
   
   方案 3: 同步恢复（简单但可能影响性能）
   ```java
   private void restoreAllRunningJobFromMasterNodeSwitch() {
       // ...
       for (Map.Entry<Long, JobInfo> entry : 
needRestoreFromMasterNodeSwitchJobs) {
           if (!runningJobMasterMap.containsKey(entry.getKey())) {
               try {
                   restoreJobFromMasterActiveSwitch(entry.getKey(), 
entry.getValue());
               } catch (Exception e) {
                   logger.severe(e);
               }
           }
       }
       // ...
   }
   ```
   
   ** Reason**: Although the probability of this race condition is low, in 
distributed systems edge cases always happen and should be prevented in advance.
   
   ---
   
   # ## Issue 8: Log level and content can be improved
   
   ** Location**: 
`seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/CoordinatorService.java:475-493`
   
   ** Related context**:
   - 方法: `restoreAllRunningJobFromMasterNodeSwitch`
   - 相关日志: line 475, 477, 488, 490-493
   
   ** Issue description**:
   当前的日志记录存在以下问题：
   
   1. **缺少关键信息**: 日志中没有记录重试次数、实际等待时间等关键指标
   2. **级别不一致**: 使用 `logger.severe(e)` 但没有区分不同严重程度的错误
   3. **缺少诊断信息**: 当 IMap 加载失败时，没有记录 IMap 的状态、S3 连接状态等
   
   ** Potential risks**:
   - **风险 1**: 难以诊断生产环境问题
   - **风险 2**: 无法监控作业恢复的健康状况
   - **风险 3**: 缺少可观测性数据
   
   ** Scope of impact**:
   - **直接影响**: 运维和调试体验
   - **间接影响**: 问题诊断效率
   - **影响面**: 单个模块
   
   ** Severity**: MINOR
   
   ** Improvement suggestions**:
   ```java
   private void restoreAllRunningJobFromMasterNodeSwitch() {
       long startTime = System.currentTimeMillis();
       List<Map.Entry<Long, JobInfo>> needRestoreFromMasterNodeSwitchJobs =
               runningJobInfoIMap.entrySet().stream()
                       .filter(entry -> 
!runningJobMasterMap.containsKey(entry.getKey()))
                       .collect(Collectors.toList());
       
       if (needRestoreFromMasterNodeSwitchJobs.isEmpty()) {
           logger.info("No jobs to restore after master switch");
           return;
       }
       
       logger.info(String.format(
           "Found %d jobs to restore after master switch", 
           needRestoreFromMasterNodeSwitchJobs.size()));
       
       // ... worker wait logic ...
       
       List<CompletableFuture<Void>> collect =
               needRestoreFromMasterNodeSwitchJobs.stream()
                       .map(
                               entry ->
                                       CompletableFuture.runAsync(
                                               () -> {
                                                   long jobRestoreStartTime = 
System.currentTimeMillis();
                                                   logger.info(
                                                           String.format(
                                                                   "Begin 
restore job %s from master active switch",
                                                                   
entry.getKey()));
                                                   try {
                                                       if 
(!runningJobMasterMap.containsKey(
                                                                       
entry.getKey())) {
                                                           
restoreJobFromMasterActiveSwitch(
                                                                   
entry.getKey(),
                                                                   
entry.getValue());
                                                           long duration = 
System.currentTimeMillis() - jobRestoreStartTime;
                                                           
logger.info(String.format(
                                                               "Job %s restored 
successfully in %d ms",
                                                               entry.getKey(), 
duration));
                                                       } else {
                                                           
logger.info(String.format(
                                                               "Job %s already 
restored by another thread, skipping",
                                                               entry.getKey()));
                                                       }
                                                   } catch (Exception e) {
                                                       long duration = 
System.currentTimeMillis() - jobRestoreStartTime;
                                                       
logger.severe(String.format(
                                                           "Job %s restore 
failed after %d ms: %s",
                                                           entry.getKey(), 
duration, 
                                                           
ExceptionUtils.getMessage(e)), e);
                                                   }
                                               },
                                               
MDCTracer.tracing(entry.getKey(), executorService)))
                       .collect(Collectors.toList());
   
       try {
           CompletableFuture<Void> voidCompletableFuture =
                   CompletableFuture.allOf(collect.toArray(new 
CompletableFuture[0]));
           voidCompletableFuture.get();
           long totalDuration = System.currentTimeMillis() - startTime;
           logger.info(String.format(
               "All job restores completed in %d ms. Success: %d, Failed: %d",
               totalDuration,
               needRestoreFromMasterNodeSwitchJobs.size(),
               // TODO: can add failure count
               0));
       } catch (Exception e) {
           long totalDuration = System.currentTimeMillis() - startTime;
           logger.severe(String.format(
               "Job restore process failed after %d ms: %s",
               totalDuration, ExceptionUtils.getMessage(e)), e);
           throw new SeaTunnelEngineException(e);
       }
   }
   ```
   
   **Rationale**: Better logging is the foundation of observability in 
distributed systems and is crucial for quickly locating and resolving issues.
   
   ---
   ---


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Fix][Zeta] Fix job restore failure after master switch when IMap sti… [seatunnel]

Reply via email to