[I] [Bug][Zeta] Job restore fails silently after master switch due to RetryableHazelcastException not being retried by SeaTunnel [seatunnel]

via GitHub Wed, 04 Mar 2026 21:40:14 -0800


ricky2129 opened a new issue, #10560:
URL: https://github.com/apache/seatunnel/issues/10560


   SeaTunnel version: 2.3.12
   
   Environment: Separated cluster mode on Kubernetes, 2 masters, 4 workers, 
IMap storage on S3 (EAGER loading)
   
   Bug description :
    During master active switch, 'restoreAllRunningJobFromMasterNodeSwitch' 
fires for all running jobs concurrently while Hazelcast is still EAGERLY 
loading 'engine*' IMaps from S3.
   
   Two failure points:
   
   1. putIfAbsent in IMapCheckpointIDCounter.start() 'checkIfLoaded()' throws 
'RetryableHazelcastException'. Hazelcast invocation framework retries 
'hazelcast.invocation.max.retry.count' times (default 250, commonly configured 
as 50) then throws to SeaTunnel. 'RetryUtils.retryWithException' receives it 
but 'ExceptionUtil.isOperationNeedRetryException' does NOT include 
'RetryableHazelcastException' → throws immediately. SeaTunnel's own 30-retry 
window is completely unused.
   
   2. get() in restoreJobFromMasterActiveSwitch null check 
'runningJobStateIMap.get(jobId)' also calls checkIfLoaded()' during EAGER 
loading. This call is not wrapped in RetryUtils at all.
   
   Result: Jobs fail init silently. They remain RUNNING in IMap on S3 (not 
permanently lost) but are invisible in UI and REST API since they never 
register in 'runningJobMasterMap'.
   
   Root cause : 'ExceptionUtil.isOperationNeedRetryException' missing 
'RetryableHazelcastException':
     return exception instanceof HazelcastInstanceNotActiveException
         || exception instanceof InterruptedException
         || exception instanceof OperationTimeoutException;
         // RetryableHazelcastException missing — it is explicitly designed to 
be retried
   
     Steps to reproduce
   
     1. Deploy SeaTunnel in separated cluster mode with S3 IMap storage (EAGER)
     2. Submit 30+ jobs (37 jobs)
     3. Trigger master pod restart/rollout
     4. Observe jobs missing from UI with Job id XXX init failed in master logs
   
     Expected behavior
   
     Jobs restore successfully after master switch regardless of S3 IMap 
loading speed
   
     Actual behavior
   
     Jobs fail to restore when IMap loading takes longer than 
invocation.max.retry.count × 500ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug][Zeta] Job restore fails silently after master switch due to RetryableHazelcastException not being retried by SeaTunnel [seatunnel]

Reply via email to