gyfora commented on PR #686: URL: https://github.com/apache/flink-kubernetes-operator/pull/686#issuecomment-1770625298
> Hi @gyfora , thanks for fix this issue. > > IIUC, the root cause of this problem is that Source Task has two threads: > > 1. Fetcher thread: fetch data from external system into flink, and cache these data into a buffer or queue > 2. Source Task thread: consume data from cached buffer or queue, and process these data. > > If one of these two operations is slow, it will cause job lag. Currently we are only compatible with the Source Task thread, not the fetcher thread, right? > > I didn't read the code in detail. I have a question after reading the purpose and solution: > > > Observed TPR can only be measured when we are catching up (during stabilization) or when cannot keep up. > > I doubt this problem can be completely fixed if `Observed TPR` cannot be measured all the time. For example, `Fetcher thread` can fetch 10 records/s, and `Source Task thread` can process 100 records/s, and the input rate of kafka topic is 200 records/s. > > The job doesn't have log when parallelism >= 20. Meanwhile, : > > * The `BusyRatio` is very low (Each subtask process 10records/s, the BusyRatio = 10/100= 10%) > * And `Observed TPR` cannot be measured. > > The autoscaler will scale down parallelism due to BusyRatio=10%. After scale down, the job has lag, so `Observed TPR` can be measured, then trigger scale up, right? > > If so, the autoscaler will scale down and scale up over and over again. > > Please correct me if my understanding is wrong, thanks~ @1996fanrui I think the scenario is mostly covered. In case we scale down due to the too low busy time, we will then start accumulating lag. If enough lag accumulated, the Observed TPR measurement will kick in and will soon rescale the job. After the job was rescaled it will have a backlog to catch up (we know this because we only triggered the measurements when there was already a lag before the restart) so we will have observed TPR measurements AFTER the restart as well. These measurements will last until the catch up is completed and the value will remain in the collected metrics subsequently so we won't scale down again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org