Hi, Bonnie Arogyam Varghese. Sorry, I missed some greetings in the previous email.
Thanks for your proposal. I left some comments, PTAL. Best regards, Yuepeng Pan. Yuepeng Pan <[email protected]> 于2026年1月15日周四 12:05写道: > Thank you very much for your response and proposal. > > Since the scope of this email and the fact that I am not familiar with > scheduling strategies in batch processing. > I could only leave a few preliminary comments on whether your proposal > could address the issues observed when streaming jobs enable both the > Adaptive Scheduler and the AutoScaler. > > In my limited reading, the issue described in FLINK-33123[1] occurs during > the execution phase, > where the JobGraph has already been generated. This phase is after the > stage where your proposed fix takes effect. > As a result, the following risks may still remain: > > - Jobs(enabled the Adaptive Scheduler and the AutoScaler) may still face > the issue > where upstream and downstream JobVertex instances are connected by Forward > edges but have inconsistent parallelism. > > - The proposed fix seems to apply only to the dynamic graph generation > mode of adaptive batch scheduling, > and therefore may not be effective for the non-dynamic graph generation > logic involved in FLINK-33123. > > In addition, the discussion thread and external references in FLINK-33123 > already explain > why ForwardPartitioner → RebalancePartitioner is used instead of > ForwardPartitioner → RescalePartitioner > in the streaming jobs(enabled the Adaptive Scheduler and the AutoScaler) > > Given this, I guess that it might be more appropriate to address batch > mode > and streaming mode as two separate cases, especially since the current > scheduling modes are implemented by different schedulers. > Of course, if a single, elegant solution can be shared between both > bug-cases, that would be ideal. > However, this would likely require some costs that are difficult to > estimate at the moment, > such as having sufficient efforts to carry out investigation, > proof-of-concept work, and subsequent reviews. > Achieving all of these in a short time frame may not be easy. > > Please correct me if I'm wrong. > Thank you very much. > > BTW, If you are confident that the issue you mentioned is still not fixed > in the latest version repo[2] > of Flink and that no developers are currently working on it, > you are more than welcome to report it in Jira[3] and try to address it. > > > [1] https://issues.apache.org/jira/browse/FLINK-33123 > [2] https://github.com/apache/flink > [3] https://issues.apache.org/jira/projects/FLINK/issues > > Best regards, > Yuepeng Pan > > > Bonnie Arogyam Varghese via dev <[email protected]> 于2026年1月15日周四 > 00:56写道: > >> Hi Yuepeng Pan and other community members, >> I have also seen a similar issue however wrt to batch. The eventual fix >> could be leveraged for both streaming and batch modes. >> >> I have described the issue observed in batch mode and can be reproduced >> with the following 2 ways: >> 1. Disable operator chaining completely >> 2. "pipeline.operator >> -chaining.chain-operators-with-different-max-parallelism": "false" // >> Default is true >> >> I have captured my findings and experimented with a fix for batch mode: >> >> https://docs.google.com/document/d/1TTj2ddlQTfDgtGb0ISmiKWt6R9U4RxJ59o6bULC1YtI/edit?usp=sharing >> >> >> >> On Tue, Jan 13, 2026 at 7:47 AM Yuepeng Pan <[email protected]> >> wrote: >> >> > Hi community, >> > >> > I would like to start a discussion around the issue described in >> > **FLINK-33123[1]**. >> > >> > This issue can mainly be broken down into two parts: >> > a). >> > Assuming that initially two upstream and downstream JobVertices >> connected >> > by a FORWARD edge have the same parallelism, >> > due to a rescale operation their parallelism becomes different. >> > In this case, the current strategy may produce incorrect results when >> > rebuilding the upstream–downstream network partition connections. >> > b). >> > Assuming that the parallelism of two upstream and downstream >> JobVertices is >> > different, >> > but due to a rescale operation their parallelism needs to be adjusted >> to be >> > the same. >> > In this scenario, it is not possible to determine the partition type >> after >> > the rescale. >> > >> > So, I'd like to share a design proposal[2] that attempts to address the >> > problem described in the ticket[1]. >> > >> > Thanks in advance for your time and feedback. >> > Looking forward to the discussion! >> > >> > >> > [1]https://issues.apache.org/jira/browse/FLINK-33123 >> > [2] >> > >> > >> https://docs.google.com/document/d/1e_6o4bdXcKtFL3xYxKeyKnRjR8ffsw6Z8frp3tp7u-M/edit?usp=sharing >> > >> > Best regards, >> > Yuepeng Pan >> > >> >
