[ 
https://issues.apache.org/jira/browse/FLINK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-33306.
------------------------------
    Resolution: Fixed

merged to main cc680e142bb8d52c4db215658ee7f4c4159a0fe4

> Use observed true processing rate when source metrics are incorrect
> -------------------------------------------------------------------
>
>                 Key: FLINK-33306
>                 URL: https://issues.apache.org/jira/browse/FLINK-33306
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Assignee: Gyula Fora
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.7.0
>
>
> The aim is to address the cases when Flink incorrectly reports low busy time 
> (high idleness) for sources that are in fact cannot keep up due to the 
> slowness of the reader/fetchers. As the metrics cannot be generally fixed on 
> the Flink - connector side we have to detect this and handle it when 
> collecting the metrics.
> The main symptom of this problem is overestimation of the true processing 
> rate and not triggering scaling even if lag is building up as the autoscaler 
> thinks it will be able to keep up.
> To tackle this we differentiate two different methods of TPR measurement:
>  # *Busy-time based TPR* (this is the current approach in the autoscaler) : 
> computed from incoming records and busy time
>  # *Observed TPR* : computed from incoming records and back pressure, 
> measurable only when we assume full processing throughput (i.e during 
> catch-up)
> h3. Current behaviour
> The operator currently always uses a busy-time based TPR calculation which is 
> very flexible and allows for scaling up / down but is susceptible to 
> overestimation due to the broken metrics.
> h3. Suggested new behaviour
> Instead of using the busy-time based TPR we detect when TPR is overestimated 
> (busy-time too low) and switch to observed TPR.
> To do this, whenever we there is lag for a source (during catchup, or 
> lag-buildup) we measure both busy-time and observed TPR.
> If the avg busy-time based TPR is off by a configured amount we switch to 
> observed TPR for this source during metric evaluation.
> *Why not use observed TPR all the time?*
> Observed TPR can only be measured when we are catching up (during 
> stabilization) or when cannot keep up. This makes it harder to scale down or 
> to detect changes in source throughput over time (before lag starts to build 
> up). Instead of using observed TPR we switch to it only when we detect a 
> problem with the busy-time (this is a rare case overall), to hopefully get 
> the best of both worlds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to