Re: [I] [Improve] auto-probe improvement [incubator-streampark]

via GitHub Sun, 15 Oct 2023 07:07:35 -0700


wolfboys commented on issue #3240:
URL: 
https://github.com/apache/incubator-streampark/issues/3240#issuecomment-1763400260


   Thx for your proposal! 🎉 Tracking the status of the cluster is crucial, so 
let's synchronize some info first.
   
   LOST status: When the watcher sends an HTTP request to the Flink job (Flink 
WebUI) or the cluster (Flink cluster | Yarn cluster) but doesn't receive a 
response due to network issues or high machine load. The job/cluster might 
still be running normally or it could have stopped. In this case, we can't 
determine the running status and have to set it as LOST.
   
   Now, let's discuss monitoring and automatic detection for jobs in the LOST 
state. I have a few suggestions for this:
   
   1. In the FlinkAppHttpWatcher, when a job or cluster is detected as LOST (no 
response from the HTTP request), I suggest not immediately triggering an alert 
or notifying the user.  we can simply mark the job or cluster as LOST without 
taking further action.
   2. In the FlinkAppLostWatcher, re-requesting http for the lost status job or 
cluster, if a response is received, we should update the job status accordingly:
      a. If the job is still running, there's no need to notify the user.
      b. If the job has failed or cancel, we should notify the user.
   3.  In the FlinkAppLostWatcher,  re-requesting http for the lost status job 
or cluster, if  the job (cluster) status is still LOST, we should continue 
retrying. However, if the number of retries reaches a certain threshold, we can 
consider the job truly lost and stop retrying. At this point, we need to notify 
the user.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Improve] auto-probe improvement [incubator-streampark]

Reply via email to