[jira] [Commented] (FLINK-36419) Consider scaling in case of no input records on a source operator with long running job

Sergey Kononov (Jira) Wed, 02 Oct 2024 05:43:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-36419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886367#comment-17886367
 ]


Sergey Kononov commented on FLINK-36419:
----------------------------------------

[~gyfora] , could you look at this issue?

> Consider scaling in case of no input records on a source operator with long 
> running job 
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-36419
>                 URL: https://issues.apache.org/jira/browse/FLINK-36419
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.9.0
>            Reporter: Sergey Kononov
>            Priority: Major
>
> There is a case when a Hudi Table Source introduces two connected vertices: 
> Split Monitor and Split Reader.  Split Monitor is considered by Flink 
> Kubernetes Operator as a factual source because it has no input vertices. 
> Split Monitor stops reading and assigning splits long before Split Reader 
> will read up all the data from Hudi table and job will be running with low 
> parallelism much longer than it could with rescaling. The reason is that now 
> Autoscaler (in 
> [ScalingMetricEvaluator|https://github.com/apache/flink-kubernetes-operator/blob/release-1.9/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingMetricEvaluator.java])
>  evaluates target data rates starting from the sources and if a source stops 
> reading the data all target data rates downstream become zero and 
> corresponding vertices are kept at lowest parallelism configured.
> Refactoring Hudi source according to FLIP-27 and FLIP-33 could have helped 
> with the issue. Meanwhile a rather simple and general fix can be made to 
> Autoscaler in order to support this case.
> The solution is to evaluate target data rate starting from the last 
> downstream vertex or vertices in the job graph that still feeds the data. It 
> would also resolve the general case when some vertices continue to process 
> data when sources already stopped. I tested this solution on a k8s cluster 
> and willing to contribute it if assigned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36419) Consider scaling in case of no input records on a source operator with long running job

Reply via email to