[
https://issues.apache.org/jira/browse/FLINK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora updated FLINK-33306:
-------------------------------
Issue Type: New Feature (was: Improvement)
> Use observed true processing rate when source metrics are incorrect
> -------------------------------------------------------------------
>
> Key: FLINK-33306
> URL: https://issues.apache.org/jira/browse/FLINK-33306
> Project: Flink
> Issue Type: New Feature
> Components: Kubernetes Operator
> Reporter: Gyula Fora
> Assignee: Gyula Fora
> Priority: Critical
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.7.0
>
>
> The aim is to address the cases when Flink incorrectly reports low busy time
> (high idleness) for sources that are in fact cannot keep up due to the
> slowness of the reader/fetchers. As the metrics cannot be generally fixed on
> the Flink - connector side we have to detect this and handle it when
> collecting the metrics.
> The main symptom of this problem is overestimation of the true processing
> rate and not triggering scaling even if lag is building up as the autoscaler
> thinks it will be able to keep up.
> To tackle this we differentiate two different methods of TPR measurement:
> # *Busy-time based TPR* (this is the current approach in the autoscaler) :
> computed from incoming records and busy time
> # *Observed TPR* : computed from incoming records and back pressure,
> measurable only when we assume full processing throughput (i.e during
> catch-up)
> h3. Current behaviour
> The operator currently always uses a busy-time based TPR calculation which is
> very flexible and allows for scaling up / down but is susceptible to
> overestimation due to the broken metrics.
> h3. Suggested new behaviour
> Instead of using the busy-time based TPR we detect when TPR is overestimated
> (busy-time too low) and switch to observed TPR.
> To do this, whenever we there is lag for a source (during catchup, or
> lag-buildup) we measure both busy-time and observed TPR.
> If the avg busy-time based TPR is off by a configured amount we switch to
> observed TPR for this source during metric evaluation.
> *Why not use observed TPR all the time?*
> Observed TPR can only be measured when we are catching up (during
> stabilization) or when cannot keep up. This makes it harder to scale down or
> to detect changes in source throughput over time (before lag starts to build
> up). Instead of using observed TPR we switch to it only when we detect a
> problem with the busy-time (this is a rare case overall), to hopefully get
> the best of both worlds.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)