dybyte commented on PR #10075:
URL: https://github.com/apache/seatunnel/pull/10075#issuecomment-3637302136
> I have a question about the behavior when a job is already stuck in the
DOING_SAVEPOINT state. In that case, can stopPipelineWithCheckpointFallback
always successfully stop the job and release the slot resources, or are there
still situations where the job may remain stuck in DOING_SAVEPOINT?
>
> ```
> ``` if
(jobMaster.getCheckpointManager().isCompletedPipeline(pipelineId)) {
> forcePipelineFinish();
> }```
> ```
>
> Conceptually, what we wanted here is a “force pause” of the job. But in
the current implementation, the force option seems to force end the job (eg set
it to CANCELED) instead of pausing it. From your point of view, does a forced
termination really count as a “pause”?
>
> @dybyte
From my understanding, the main purpose of this feature is to forcefully
terminate a job that is stuck in an certain state, so that it does not continue
holding slot resources indefinitely.
For that reason, the implementation focuses on ending the job rather than
pausing it.
Regarding the job being stuck in the `DOING_SAVEPOINT` state, the reporter
did not provide detailed logs, so it’s difficult to identify the exact root
cause. My assumption is that it may be due to an issue during the
savepoint-writing process, or not receiving the termination signal correctly.
Except for extreme cases such as deadlocks, I believe the current logic
should be able to successfully terminate the job and release its slot resources.
Please let me know if there is anything I might have overlooked. Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]