gyfora opened a new pull request, #686:
URL: https://github.com/apache/flink-kubernetes-operator/pull/686

   ## What is the purpose of the change
   
   The aim of this PR is to address the cases when operator incorrectly reports 
low busy time (high idleness) for sources that are in fact cannot keep up due 
to the slowness of the reader/fetchers. As the metrics cannot be generally 
fixed on the Flink - connector side we have to detect this and handle it when 
collecting the metrics.
   
   The main symptom of this problem is overestimation of the true processing 
rate and not triggering scaling even if lag is building up as the autoscaler 
thinks it will be able to keep up.
   
   To tackle this we differentiate two different methods of TPR measurement:
    1. **Busy-time based TPR** (this is the current approach in the autoscaler) 
: computed from incoming records and busy time 
    2. **Observed TPR** : computed from incoming records and back pressure, 
measurable only when we assume full processing throughput (i.e during catch-up)
   
   ### Current behaviour
   
   The operator currently always uses a busy-time based TPR calculation which 
is very flexible and allows for scaling up / down but is susceptible to 
overestimation due to the broken metrics. 
   
   ### Suggested new behaviour
   
   Instead of using the busy-time based TPR we detect when TPR is overestimated 
(busy-time too low) and switch to observed TPR. 
   
   To do this, whenever we there is lag for a source (during catchup, or 
lag-buildup) we measure both busy-time and observed TPR.
   
   If the avg busy-time based TPR is off by a configured amount we switch to 
observed TPR for this source during metric evaluation. 
   
   **Why not use observed TPR all the time?**
   Observed TPR can only be measured when we are catching up (during 
stabilization) or when cannot keep up. This makes it harder to scale down or to 
detect changes in source throughput over time (before lag starts to build up). 
Instead of using observed TPR we switch to it only when we detect a problem 
with the busy-time (this is a rare case overall), to hopefully get the best of 
both worlds.
   
   ### Other related fixes
   
   #### Fix metric name querying for sources
   
   Metric names were only previously queried in topology change, however this 
is not good for sources where new metrics are dynamically added/removed. Now 
source metric names are always queried.
   
   #### Fix infinity handling in average
   
   Previously avg logic returned infinity on any infinite value in the array. 
This causes invalid scale downs when new records are indeed coming in. The 
logic has been fixed to only return infinity when all of them are infinity.
   
   ## Brief change log
   
     - *Fix source metric name querying (always refresh for sources to detect 
new partition metrics)*
     - *Collect metrics also during stabilization (but do not evaluate) to 
allow measuring observed TPR*
     - *Add new TPR measurement logic for sources*
     - *Change TPR evaluation logic to switch between observed vs busy time 
based true processing rate*
     - *Fix infinity handling for average (only return infinity if all infinite 
otherwise ignore)*
   
   ## Verifying this change
   
   Unit tests + manual verification
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to