[
https://issues.apache.org/jira/browse/FLINK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079285#comment-18079285
]
Trystan edited comment on FLINK-33306 at 5/8/26 12:56 AM:
----------------------------------------------------------
I know this is an old issue, but what is the best way to almost completely opt
out of this?
# Set `observed-true-processing-rate.lag-threshold` to a huge number?
# Set `observed-true-processing-rate.min-observations` to a large number (is
the observation interval ~15s as the logs would indicate?)
# Set `observed-true-processing-rate.switch-threshold` to a large number like
3?
For context, this setting causes about half of our jobs' autoscaling to flap.
It scales down, sandbags the source's "observed" TPR, then later scales it WAY
up - even after the job has fully caught up and there is no lag at all - and
repeats again and again. This scaling config makes it seem like the source is
fully incapable of changing speed, which is objectively not true in many cases
(JVM warmup, intentional backpressure via a slow-ramp rate limiter, etc). Even
just a minute or so later I can _observe_ the source pushing 3x what the
autoscaler claims is the "observed TPR".
was (Author: trystan):
I know this is an old issue, but what is the best way to almost completely opt
out of this?
# Set `observed-true-processing-rate.lag-threshold` to a huge number?
# Set `observed-true-processing-rate.min-observations` to a large number (is
the observation interval ~15s as the logs would indicate?)
# Set `observed-true-processing-rate.switch-threshold` to a large number like
3?
For context, this setting causes about half of our jobs' autoscaling to flap.
It scales down, sandbags the source's "observed" TPR, then later scales it WAY
up - even after the job has fully caught up and there is no lag at all - and
repeats again and again. This scaling config makes it seem like the source is
fully incapable of changing speed, which is objectively not true in many cases
(JVM warmup, intentional backpressure via a slow-ramp rate limiter, etc). Even
just a minute or so later I can _observe_ the source pushing 3x what the
autoscaler claims is the "observed TPR".
> Use observed true processing rate when source metrics are incorrect
> -------------------------------------------------------------------
>
> Key: FLINK-33306
> URL: https://issues.apache.org/jira/browse/FLINK-33306
> Project: Flink
> Issue Type: New Feature
> Components: Kubernetes Operator
> Reporter: Gyula Fora
> Assignee: Gyula Fora
> Priority: Critical
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.7.0
>
>
> The aim is to address the cases when Flink incorrectly reports low busy time
> (high idleness) for sources that are in fact cannot keep up due to the
> slowness of the reader/fetchers. As the metrics cannot be generally fixed on
> the Flink - connector side we have to detect this and handle it when
> collecting the metrics.
> The main symptom of this problem is overestimation of the true processing
> rate and not triggering scaling even if lag is building up as the autoscaler
> thinks it will be able to keep up.
> To tackle this we differentiate two different methods of TPR measurement:
> # *Busy-time based TPR* (this is the current approach in the autoscaler) :
> computed from incoming records and busy time
> # *Observed TPR* : computed from incoming records and back pressure,
> measurable only when we assume full processing throughput (i.e during
> catch-up)
> h3. Current behaviour
> The operator currently always uses a busy-time based TPR calculation which is
> very flexible and allows for scaling up / down but is susceptible to
> overestimation due to the broken metrics.
> h3. Suggested new behaviour
> Instead of using the busy-time based TPR we detect when TPR is overestimated
> (busy-time too low) and switch to observed TPR.
> To do this, whenever we there is lag for a source (during catchup, or
> lag-buildup) we measure both busy-time and observed TPR.
> If the avg busy-time based TPR is off by a configured amount we switch to
> observed TPR for this source during metric evaluation.
> *Why not use observed TPR all the time?*
> Observed TPR can only be measured when we are catching up (during
> stabilization) or when cannot keep up. This makes it harder to scale down or
> to detect changes in source throughput over time (before lag starts to build
> up). Instead of using observed TPR we switch to it only when we detect a
> problem with the busy-time (this is a rare case overall), to hopefully get
> the best of both worlds.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)