ricky2129 opened a new issue, #10560:
URL: https://github.com/apache/seatunnel/issues/10560
SeaTunnel version: 2.3.12
Environment: Separated cluster mode on Kubernetes, 2 masters, 4 workers,
IMap storage on S3 (EAGER loading)
Bug description :
During master active switch, 'restoreAllRunningJobFromMasterNodeSwitch'
fires for all running jobs concurrently while Hazelcast is still EAGERLY
loading 'engine*' IMaps from S3.
Two failure points:
1. putIfAbsent in IMapCheckpointIDCounter.start() 'checkIfLoaded()' throws
'RetryableHazelcastException'. Hazelcast invocation framework retries
'hazelcast.invocation.max.retry.count' times (default 250, commonly configured
as 50) then throws to SeaTunnel. 'RetryUtils.retryWithException' receives it
but 'ExceptionUtil.isOperationNeedRetryException' does NOT include
'RetryableHazelcastException' → throws immediately. SeaTunnel's own 30-retry
window is completely unused.
2. get() in restoreJobFromMasterActiveSwitch null check
'runningJobStateIMap.get(jobId)' also calls checkIfLoaded()' during EAGER
loading. This call is not wrapped in RetryUtils at all.
Result: Jobs fail init silently. They remain RUNNING in IMap on S3 (not
permanently lost) but are invisible in UI and REST API since they never
register in 'runningJobMasterMap'.
Root cause : 'ExceptionUtil.isOperationNeedRetryException' missing
'RetryableHazelcastException':
return exception instanceof HazelcastInstanceNotActiveException
|| exception instanceof InterruptedException
|| exception instanceof OperationTimeoutException;
// RetryableHazelcastException missing — it is explicitly designed to
be retried
Steps to reproduce
1. Deploy SeaTunnel in separated cluster mode with S3 IMap storage (EAGER)
2. Submit 30+ jobs (37 jobs)
3. Trigger master pod restart/rollout
4. Observe jobs missing from UI with Job id XXX init failed in master logs
Expected behavior
Jobs restore successfully after master switch regardless of S3 IMap
loading speed
Actual behavior
Jobs fail to restore when IMap loading takes longer than
invocation.max.retry.count × 500ms
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]