wolfboys commented on issue #3240:
URL:
https://github.com/apache/incubator-streampark/issues/3240#issuecomment-1763400260
Thx for your proposal! 🎉 Tracking the status of the cluster is crucial, so
let's synchronize some info first.
LOST status: When the watcher sends an HTTP request to the Flink job (Flink
WebUI) or the cluster (Flink cluster | Yarn cluster) but doesn't receive a
response due to network issues or high machine load. The job/cluster might
still be running normally or it could have stopped. In this case, we can't
determine the running status and have to set it as LOST.
Now, let's discuss monitoring and automatic detection for jobs in the LOST
state. I have a few suggestions for this:
1. In the FlinkAppHttpWatcher, when a job or cluster is detected as LOST (no
response from the HTTP request), I suggest not immediately triggering an alert
or notifying the user. we can simply mark the job or cluster as LOST without
taking further action.
2. In the FlinkAppLostWatcher, re-requesting http for the lost status job or
cluster, if a response is received, we should update the job status accordingly:
a. If the job is still running, there's no need to notify the user.
b. If the job has failed or cancel, we should notify the user.
3. In the FlinkAppLostWatcher, re-requesting http for the lost status job
or cluster, if the job (cluster) status is still LOST, we should continue
retrying. However, if the number of retries reaches a certain threshold, we can
consider the job truly lost and stop retrying. At this point, we need to notify
the user.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]