Re: [DISCUSS] A design proposal to fix the wrong dynamic replacement of partitioner from FORWARD to REBLANCE for AutoScaler and AdaptiveScheduler

Yuepeng Pan Wed, 14 Jan 2026 20:05:49 -0800

Thank you very much for your response and proposal.

Since the scope of this email and the fact that I am not familiar with
scheduling strategies in batch processing.
I could only leave a few preliminary comments on whether your proposal
could address the issues observed when streaming jobs enable both the
Adaptive Scheduler and the AutoScaler.


In my limited reading, the issue described in FLINK-33123[1] occurs during
the execution phase,
where the JobGraph has already been generated. This phase is after the
stage where your proposed fix takes effect.
As a result, the following risks may still remain:

- Jobs(enabled the Adaptive Scheduler and the AutoScaler) may still face
the issue
where upstream and downstream JobVertex instances are connected by Forward
edges but have inconsistent parallelism.

- The proposed fix seems to apply only to the dynamic graph generation mode
of adaptive batch scheduling,
and therefore may not be effective for the non-dynamic graph generation
logic involved in FLINK-33123.

In addition, the discussion thread and external references in FLINK-33123
already explain
why ForwardPartitioner → RebalancePartitioner is used instead of
ForwardPartitioner → RescalePartitioner
in the streaming jobs(enabled the Adaptive Scheduler and the AutoScaler)

Given this, I guess that it might be more appropriate to address batch mode
and streaming mode as two separate cases, especially since the current
scheduling modes are implemented by different schedulers.
Of course, if a single, elegant solution can be shared between both
bug-cases, that would be ideal.
However, this would likely require some costs that are difficult to
estimate at the moment,
such as having sufficient efforts to carry out investigation,
proof-of-concept work, and subsequent reviews.
Achieving all of these in a short time frame may not be easy.

Please correct me if I'm wrong.
Thank you very much.

BTW, If you are confident that the issue you mentioned is still not fixed
in the latest version repo[2]
of Flink and that no developers are currently working on it,
you are more than welcome to report it in Jira[3] and try to address it.


[1] https://issues.apache.org/jira/browse/FLINK-33123
[2] https://github.com/apache/flink
[3] https://issues.apache.org/jira/projects/FLINK/issues

Best regards,
Yuepeng Pan


Bonnie Arogyam Varghese via dev <[email protected]> 于2026年1月15日周四
00:56写道：

> Hi Yuepeng Pan and other community members,
>  I have also seen a similar issue however wrt to batch. The eventual fix
> could be leveraged for both streaming and batch modes.
>
> I have described the issue observed in batch mode and can be reproduced
> with the following 2 ways:
> 1. Disable operator chaining completely
> 2. "pipeline.operator
> -chaining.chain-operators-with-different-max-parallelism": "false" //
> Default is true
>
> I have captured my findings and experimented with a fix for batch mode:
>
> https://docs.google.com/document/d/1TTj2ddlQTfDgtGb0ISmiKWt6R9U4RxJ59o6bULC1YtI/edit?usp=sharing
>
>
>
> On Tue, Jan 13, 2026 at 7:47 AM Yuepeng Pan <[email protected]>
> wrote:
>
> > Hi community,
> >
> > I would like to start a discussion around the issue described in
> > **FLINK-33123[1]**.
> >
> > This issue can mainly be broken down into two parts:
> > a).
> > Assuming that initially two upstream and downstream JobVertices connected
> > by a FORWARD edge have the same parallelism,
> > due to a rescale operation their parallelism becomes different.
> > In this case, the current strategy may produce incorrect results when
> > rebuilding the upstream–downstream network partition connections.
> > b).
> > Assuming that the parallelism of two upstream and downstream JobVertices
> is
> >  different,
> > but due to a rescale operation their parallelism needs to be adjusted to
> be
> > the same.
> > In this scenario, it is not possible to determine the partition type
> after
> > the rescale.
> >
> > So, I'd like to share a design proposal[2] that attempts to address the
> > problem described in the ticket[1].
> >
> > Thanks in advance for your time and feedback.
> > Looking forward to the discussion!
> >
> >
> > [1]https://issues.apache.org/jira/browse/FLINK-33123
> > [2]
> >
> >
> https://docs.google.com/document/d/1e_6o4bdXcKtFL3xYxKeyKnRjR8ffsw6Z8frp3tp7u-M/edit?usp=sharing
> >
> > Best regards,
> > Yuepeng Pan
> >
>

Re: [DISCUSS] A design proposal to fix the wrong dynamic replacement of partitioner from FORWARD to REBLANCE for AutoScaler and AdaptiveScheduler

Reply via email to