gyfora commented on PR #686:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/686#issuecomment-1770625298

   > Hi @gyfora , thanks for fix this issue.
   > 
   > IIUC, the root cause of this problem is that Source Task has two threads:
   > 
   > 1. Fetcher thread: fetch data from external system into flink, and cache 
these data into a buffer or queue
   > 2. Source Task thread: consume data from cached buffer or queue, and 
process these data.
   > 
   > If one of these two operations is slow, it will cause job lag. Currently 
we are only compatible with the Source Task thread, not the fetcher thread, 
right?
   > 
   > I didn't read the code in detail. I have a question after reading the 
purpose and solution:
   > 
   > > Observed TPR can only be measured when we are catching up (during 
stabilization) or when cannot keep up.
   > 
   > I doubt this problem can be completely fixed if `Observed TPR` cannot be 
measured all the time. For example, `Fetcher thread` can fetch 10 records/s, 
and `Source Task thread` can process 100 records/s, and the input rate of kafka 
topic is 200 records/s.
   > 
   > The job doesn't have log when parallelism >= 20. Meanwhile, :
   > 
   > * The `BusyRatio` is very low (Each subtask process 10records/s, the 
BusyRatio = 10/100= 10%)
   > * And `Observed TPR` cannot be measured.
   > 
   > The autoscaler will scale down parallelism due to BusyRatio=10%. After 
scale down, the job has lag, so `Observed TPR` can be measured, then trigger 
scale up, right?
   > 
   > If so, the autoscaler will scale down and scale up over and over again.
   > 
   > Please correct me if my understanding is wrong, thanks~
   
   @1996fanrui 
   I think the scenario is mostly covered. In case we scale down due to the too 
low busy time, we will then start accumulating lag. If enough lag accumulated, 
the Observed TPR measurement will kick in and will soon rescale the job.
   
   After the job was rescaled it will have a backlog to catch up (we know this 
because we only triggered the measurements when there was already a lag before 
the restart) so we will have observed TPR measurements AFTER the restart as 
well. These measurements will last until the catch up is completed and the 
value will remain in the collected metrics subsequently so we won't scale down 
again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to