1996fanrui commented on PR #686:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/686#issuecomment-1770560740

   Hi @gyfora , thanks for fix this issue.
   
   IIUC, the root cause of this problem is that Source Task has two threads:
   1. Fetcher thread: fetch data from external system into flink, and cache 
these data into a buffer or queue
   2. Source Task thread: consume data from cached buffer or queue, and process 
these data.
   
   If one of these two operations is slow, it will cause job lag. Currently we 
are only compatible with the Source Task thread, not the fetcher thread, right?
   
   I didn't read the code in detail. I have a question after reading the 
purpose and solution:
   
   > Observed TPR can only be measured when we are catching up (during 
stabilization) or when cannot keep up.
   
   I doubt this problem can be completely fixed if `Observed TPR` cannot be 
measured all the time. For example, `Fetcher thread` can fetch 10 records/s, 
and `Source Task thread` can process 100 records/s, and the input rate of kafka 
topic is 200 records/s.
   
   The job doesn't have log when parallelism >= 20. Meanwhile, :
   - The `BusyRatio` is very low (Each subtask process 10records/s, the 
BusyRatio = 10/100= 10%)
   - And `Observed TPR` cannot be measured. 
   
   The autoscaler will scale down parallelism due to BusyRatio=10%. After scale 
down, the job has lag, so `Observed TPR` can be measured, then trigger scale 
up, right?
   
   If so, the autoscaler will scale down and scale up over and over again. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to