xujiangfeng001 commented on issue #3240: URL: https://github.com/apache/incubator-streampark/issues/3240#issuecomment-1764164160
> Thx for your proposal! 🎉 Tracking the status of the cluster is crucial, so let's synchronize some info first. > > LOST status: When the watcher sends an HTTP request to the Flink job (Flink WebUI) or the cluster (Flink cluster | Yarn cluster) but doesn't receive a response due to network issues or high machine load. The job/cluster might still be running normally or it could have stopped. In this case, we can't determine the running status and have to set it as LOST. > > Now, let's discuss monitoring and automatic detection for jobs in the LOST state. I have a few suggestions for this: > > 1. In the FlinkAppHttpWatcher, when a job or cluster is detected as LOST (no response from the HTTP request), I suggest not immediately triggering an alert or notifying the user. we can simply mark the job or cluster as LOST without taking further action. > 2. In the FlinkAppLostWatcher, re-requesting http for the lost status job or cluster, if a response is received, we should update the job status accordingly: > a. If the job is still running, there's no need to notify the user. > b. If the job has failed or cancel, we should notify the user. > 3. In the FlinkAppLostWatcher, re-requesting http for the lost status job or cluster, if the job (cluster) status is still LOST, we should continue retrying. However, if the number of retries reaches a certain threshold, we can consider the job truly lost and stop retrying. At this point, we need to notify the user. Thank you for your suggestions, this looks very promising. I will incorporate these logics into the development process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
