Re: [DISCUSS] CIP-17 Interruption Aware Slot Selection

Nicholas Jiang Tue, 13 May 2025 09:40:55 -0700

Hi Aravind,

Thanks for driving the proposal of interruption aware slot selection. I have 
some comments for this proposal:

1. Why is the lack of Flink replication support the motivation for disruption 
aware slot selection? Do you mean that disruption aware slot selection helps to 
reduce recompution costs for Flink without replication support?

2. Can you provide a complete definition of the /updateInterruptionNotice 
interface? Meanwhile, could you also provide the definition of corresponding 
CLI interface?

3. How is the performance of disruption aware slot selection? Which scenario 
could users use disruption aware slot selection?

4. How could the cluster administrator determine workersWithLateInterruptions 
and workersWithEarlyInterruptions? BTW, how does the administrator evaluate the 
threshold percentile based on the range of interruption timestamps?

Regards,
Nicholas Jiang

On 2025/05/02 07:40:15 Aravind Patnam wrote:
> Hi Celeborn community,
> 
> I have written up CIP-17: Interruption Aware Slot Selection
> <https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?usp=sharing>.
> Please review and let me know if there are any comments or questions.
> 
> This is a feature we have introduced internally, given our heavy volume of
> interruptions. We have seen substantial decrease in task failures in both
> Flink and Spark jobs, and think the community would also benefit from this
> :)
> 
> Looking forward to getting feedback from the community.
> 
> Thanks,
> Aravind
> 
>  CIP 17: Interruption Aware Slot Selection
> <https://drive.google.com/open?id=16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw>
>

Re: [DISCUSS] CIP-17 Interruption Aware Slot Selection

Reply via email to