DanielCarter-stack commented on issue #10409: URL: https://github.com/apache/seatunnel/issues/10409#issuecomment-3815781132
<!-- code-pr-reviewer --> Thanks for reporting. Based on the stack traces and code, this appears to be two related stability issues under high load: 1. **OperationTimeoutException**: `GetJobStatusOperation` blocks on `future.get()` without timeout in `GetJobStatusOperation.run()`. When the `get_job_status_operation` executor saturates or `getJobStatus()` is slow (accessing JobMaster/IMap in `CoordinatorService.getJobStatus()`), it exceeds the 25s Hazelcast call-timeout. 2. **NullPointerException**: `CheckpointCoordinator.completePendingCheckpoint()` at line 913 calls `completedCheckpoint.getCheckpointType()`, suggesting `completedCheckpoint` is null. The code already has a null check in the `whenCompleteAsync` callback (lines 653-656 in `CheckpointCoordinator`), so this may indicate a concurrent modification issue or version mismatch. **To help us reproduce/fix**, could you provide: - Cluster size and number of concurrent jobs? - Your `hazelcast.yaml` configuration (especially `operation.call-timeout`)? - Whether this occurs during checkpoint triggers or specific REST API calls? - Full version details for `2.3.12.hb-SNAPSHOT`? **Workaround**: Consider increasing `hazelcast.operation.call-timeout` (default 25000ms) in your configuration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
