Re: [I] [Improve] auto-probe improvement [incubator-streampark]

via GitHub Mon, 16 Oct 2023 03:20:03 -0700


xujiangfeng001 commented on issue #3240:
URL: 
https://github.com/apache/incubator-streampark/issues/3240#issuecomment-1764164160


   > Thx for your proposal! 🎉 Tracking the status of the cluster is crucial, so 
let's synchronize some info first.
   > 
   > LOST status: When the watcher sends an HTTP request to the Flink job 
(Flink WebUI) or the cluster (Flink cluster | Yarn cluster) but doesn't receive 
a response due to network issues or high machine load. The job/cluster might 
still be running normally or it could have stopped. In this case, we can't 
determine the running status and have to set it as LOST.
   > 
   > Now, let's discuss monitoring and automatic detection for jobs in the LOST 
state. I have a few suggestions for this:
   > 
   > 1. In the FlinkAppHttpWatcher, when a job or cluster is detected as LOST 
(no response from the HTTP request), I suggest not immediately triggering an 
alert or notifying the user.  we can simply mark the job or cluster as LOST 
without taking further action.
   > 2. In the FlinkAppLostWatcher, re-requesting http for the lost status job 
or cluster, if a response is received, we should update the job status 
accordingly:
   >    a. If the job is still running, there's no need to notify the user.
   >    b. If the job has failed or cancel, we should notify the user.
   > 3. In the FlinkAppLostWatcher,  re-requesting http for the lost status job 
or cluster, if  the job (cluster) status is still LOST, we should continue 
retrying. However, if the number of retries reaches a certain threshold, we can 
consider the job truly lost and stop retrying. At this point, we need to notify 
the user.
   
   Thank you for your suggestions, this looks very promising. I will 
incorporate these logics into the development process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Improve] auto-probe improvement [incubator-streampark]

Reply via email to