Hi Lei,

Thank you for the thorough review and the insightful feedback! Your questions 
on 
failover and scaling are very valuable. Here are my responses:


1. Relationship between RpcOperator and JobGraph recovery semantics
This is a great observation. RpcOperator achieves failover isolation through an 
independent Pipelined Region. Under the **DefaultScheduler**, this works as 
intended — a subtask failure only restarts the affected Region, and the 
RpcOperator 
Region remains unaffected (and vice versa).
However, the **AdaptiveScheduler** currently only supports full-graph restarts 
— it 
cannot restart individual Regions independently. This means that during both 
failover 
and rescale, the data plane and the service plane cannot be isolated: the 
entire 
ExecutionGraph (including RpcOperator vertices) is torn down and recreated. 
This 
is a limitation of the AdaptiveScheduler. Achieving true isolation under 
AdaptiveScheduler would require enhancing it with region-level restart 
capabilities, 
which is orthogonal to the RpcOperator design and should be pursued 
independently.
I will update the FLIP to clarify the scope of failover isolation with respect 
to different 
schedulers.


2. Client support for future flexible scaling
Great point, and I fully agree. Supporting push-based endpoint updates is 
essential 
for the client to discover instance additions and removals in a timely manner, 
especially for future no-restart dynamic scaling. I will update the FLIP to 
incorporate 
a push-based endpoint update mechanism in the service discovery design.


Thank you for the feedback and suggestions. Please feel free to follow up if 
you 
have any further questions!


Best,
Yi



At 2026-06-03 14:20:00, "Lei Yang" <[email protected]> wrote:
>Hi Yi,
>
>Thank you for the great work on FLIP-582. As GPU resources become
>increasingly valuable, RpcOperator can significantly improve resource
>utilization and strengthening Flink’s competitiveness for AI workloads.I am
>particularly interested in the failover and flexible scaling aspects of
>this
>FLIP, and I have two questions that I hope you can help clarify.
>
>1. Relationship between RpcOperator and JobGraph recovery semantics
>
>The current FLIP models RpcOperator as an independent JobVertex in
>the JobGraph. This means that although it can be isolated from the data
>plane in terms of resources and regions, it still belongs to the same job
>graph in terms of scheduling, recovery, and rescaling semantics.
>
>For example, in AdaptiveScheduler, rescaling or failure recovery may enter
>the Restarting -> CreatingExecutionGraph path and rebuild the
>ExecutionGraph.
>An external autoscaler can also adjust the parallelism requirements of job
>vertices through the JobResourceRequirements REST API; with
>AdaptiveScheduler,
>this may further trigger rescale / restart. In such cases, RpcOperator may
>also
>be restarted together with the data-plane tasks, since it is part of the
>same
>JobGraph / ExecutionGraph.
>
>This seems to leave some gap with the goal of RpcOperator being an
>independent
>service that is not affected by the data plane. Therefore, I would like to
>confirm
>whether the FLIP plans to introduce a more fine-grained recovery and
>restart
>mechanism, so that RpcOperator can restart, fail over, or rescale
>independently
>from data processing vertices.
>
>2. Client support for future flexible scaling
>
>The FLIP mentions that RpcOperator instances are independent service
>instances, and that an instance going online or offline should not affect
>other
>instances or the data processing flow. I understand that the current FLIP
>may
>not need to fully support dynamic scaling in the first phase. However, if
>flexible
>scaling of RpcOperator is expected in the future, the client may need to be
>aware of changes in RpcOperator parallelism and the instance list.
>
>For example, during a future scale-out, the system may start a new
>RpcOperator
> instance without restarting existing ones. After the new instance becomes
>ready,
>the client needs to discover it in time and include it in request routing.
>During scale-in,
>the client also needs to detect instance removal in time and avoid sending
>new
>requests to instances that are about to exit.
>
>Therefore, I would like to confirm whether the current ROSClient design can
>support push-based discovery of RpcOperator instance additions and removals
> for future no-restart dynamic scaling.
>
>Best,
>Lei
>
>Yi Zhang <[email protected]> 于2026年5月27日周三 14:12写道:
>
>> Hi everyone,
>>
>>
>>
>> I would like to start a discussion on FLIP-582: Support RpcOperator
>> Service [1].
>>
>>
>> AI-oriented workloads like multimodal data processing and model inference
>> are
>> growing rapidly in recent years. These workloads are characterized by
>> expensive
>> resources (GPUs) and high initialization costs (seconds to minutes for
>> model
>> loading). In today's Flink, embedding them in the data plane couples their
>> parallelism and failover with surrounding operators; deploying them as
>> external
>> services disconnects their lifecycle from the job and doubles operational
>> overhead.
>>
>>
>> This FLIP introduces RpcOperator Service — a framework-level primitive
>> that runs
>> user-defined compute as RPC services in an independent Pipelined Region
>> within
>> the Flink job. Because the service is isolated at the scheduling level, it
>> can achieve
>> fault isolation, independent scaling, and dedicated resource allocation.
>> As a native
>> Flink primitive, it also lays the foundation for automatic flow control,
>> flexible load
>> balancing, and coordinated auto-scaling — all without introducing external
>> infrastructure or additional operational burden.
>>
>>
>>
>>
>> Looking forward to your feedback and suggestions!
>>
>>
>>
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-582%3A+Support+RpcOperator+Service
>>
>>
>>
>>
>>
>> Best Regards,
>> Yi Zhang

Reply via email to