Hi Aravind,

Thanks for your detailed explanation.

The mechanism of disruption aware slot selection makes sense to me. But I'm 
curious what the failure rate of tasks before you came up with this solution 
for substantial decrease (20x) in task failures. The task failure rate in our 
production environment is already very low, especially for Flink batch jobs 
without replica support. Meanwhile, is there any configuration changes for 
disruption aware slot selection? I did not see any configuration changes in 
this proposal.

Regards,
Nicholas Jiang

On 2025/05/13 21:46:46 Aravind Patnam wrote:
> Hi Nicholas,
> 
> Thanks for the feedback and questions! Let me answer your questions below.
> 
> > 1. Why is the lack of Flink replication support the motivation for
> disruption aware slot selection? Do you mean that disruption aware slot
> selection helps to reduce recompution costs for Flink without replication
> support?
> Yes, this would significantly reduce recomputation costs. This is
> especially in an environment like ours internally, where we experience
> constant interruptions throughout the day to a large set of workers. By
> proactively only using workers that won't get interrupted, we can minimize
> this recomputation cost and reduce the task failures.
> 
> > 2. Can you provide a complete definition of the /updateInterruptionNotice
> interface? Meanwhile, could you also provide the definition of
> corresponding CLI interface?
> Yes, sure - I have updated the CIP google doc to now include the interface
> <https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?pli=1&tab=t.0#heading=h.96cl46gz9o8l>
> now. In summary, it would be a PUT api, which would take in a List of
> *UpdateInterruptionNoticeRequests.
> *This object's fields would simply be the worker hostname along with the
> next known interruption timestamp in milliseconds. Every time this API is
> called, it will reset the workers not included in the request back to no
> nextKnownDisruption automatically.
> 
> > 3. How is the performance of disruption aware slot selection? Which
> scenario could users use disruption aware slot selection?
> Internally, we have enabled this feature and we see great improvements to
> our task failures. You can see the results here
> <https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?pli=1&tab=t.0#bookmark=id.asvbrvv0c9q6>
> --
> we roughly see *20x decrease for Spark task failures related to
> interruptions*. This is because we have a high rate of interruptions
> internally. Other users with similar environments would greatly benefit
> from this feature, as they can enable this to prioritize workers that will
> not get interrupted.
> 
> > 4. How could the cluster administrator determine
> workersWithLateInterruptions and workersWithEarlyInterruptions? BTW, how
> does the administrator evaluate the threshold percentile based on the range
> of interruption timestamps?
> This sort of depends on how the interruptions are scheduled and how much
> advance notice is present, along with the application characteristics in
> the cluster. We plan on distinguishing the workers as follows in the code:
> 
> 
>    -
> 
>    prioritizedWorkers = workersWithNoInterruptions +
>    workersWithLateInterruptions
>    -
> 
>    deprioritizedWorkers = workersWithEarlyInterruptions
> 
> 
> The calculation is:
> 
>    - workersWithLateInterruptions = percentageThreshold *
>    sorted(totalWorkers)
>    - workersWithEarlyInterruptions = totalWorkers -
>    workersWithLateInterruptions
> 
> 
> This means that the higher the percentage threshold, the higher the size of
> workersWithLateInterruptions, which means the higher the number of
> prioritizedWorkers.
> 
> For example, if you have a larger advance notice (such as 1 week or so), it
> makes sense to set this percentage threshold higher (e.g. 90%). If you have
> a smaller advanced notice window, it makes sense to set it something lower
> (e.g. 50%).
> 
> Consider this example where there are 10 workers with interruption
> timestamps t5-t14, and current time t0:
> 
> w1
> 
> w2
> 
> w3
> 
> w4
> 
> w5
> 
> w6
> 
> w7
> 
> w8
> 
> w9
> 
> w10
> 
> t5
> 
> t6
> 
> t7
> 
> t8
> 
> t9
> 
> t10
> 
> t11
> 
> t12
> 
> t13
> 
> t14
> 
> If we set percentageThreshold to 50%, then the following would occur:
> 
>    - w1 - w5 would be considered workersWithEarlyInterruptions
>    - w6 - w10 would be considered workersWithLateInterruptions
> 
> Given current time is only t0 and is far away from t5, this means we are
> probably wasting workers w1 - w5 since they have a long time to be
> disrupted anyways. Hence, in this case it would make more sense to set a
> higher percentage threshold such as 90%. Then it would be like this:
> 
>    - w1 would be considered workersWithEarlyInterruptions
>    - w2 - w10 would be considered workersWithLateInterruptions
> 
> This is better for the cluster given a larger advanced notice.
> 
> However, if there is a smaller advanced notice, consider this example:
> 
> w1
> 
> w2
> 
> w3
> 
> w4
> 
> w5
> 
> w6
> 
> w7
> 
> w8
> 
> w9
> 
> w10
> 
> t1
> 
> t2
> 
> t3
> 
> t4
> 
> t5
> 
> t6
> 
> t7
> 
> t8
> 
> t9
> 
> t10
> 
> If we set percentageThreshold to 90%, then the following would occur:
> 
>    - w1 would be considered workersWithEarlyInterruptions
>    - w2 - w10 would be considered workersWithLateInterruptions
> 
> Given current time is only t0 and is close to t1 or t2, medium/long running
> jobs would probably incur failures when w2-w10 go down. Hence in this case,
> it might be better to set to a lower percentage, such as 50%. Then it would
> be like this:
> 
>    - w1-5 would be considered workersWithEarlyInterruptions
>    - w6 - w10 would be considered workersWithLateInterruptions
> 
> This is better for the cluster given a smaller advanced notice.
> 
> Overall, the cluster administrator should take into account the
> interruption advanced notice time, along with the average/median job
> runtime in the cluster before setting this value.
> We can add more documentation on how to set this value once we start
> implementing it.
> 
> Hope this answers all your questions!
> 
> Thanks,
> Aravind
> 
> 
> On Tue, May 13, 2025 at 9:40 AM Nicholas Jiang <nicholasji...@apache.org>
> wrote:
> 
> > Hi Aravind,
> >
> > Thanks for driving the proposal of interruption aware slot selection. I
> > have some comments for this proposal:
> >
> > 1. Why is the lack of Flink replication support the motivation for
> > disruption aware slot selection? Do you mean that disruption aware slot
> > selection helps to reduce recompution costs for Flink without replication
> > support?
> >
> > 2. Can you provide a complete definition of the /updateInterruptionNotice
> > interface? Meanwhile, could you also provide the definition of
> > corresponding CLI interface?
> >
> > 3. How is the performance of disruption aware slot selection? Which
> > scenario could users use disruption aware slot selection?
> >
> > 4. How could the cluster administrator determine
> > workersWithLateInterruptions and workersWithEarlyInterruptions? BTW, how
> > does the administrator evaluate the threshold percentile based on the range
> > of interruption timestamps?
> >
> > Regards,
> > Nicholas Jiang
> >
> > On 2025/05/02 07:40:15 Aravind Patnam wrote:
> > > Hi Celeborn community,
> > >
> > > I have written up CIP-17: Interruption Aware Slot Selection
> > > <
> > https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?usp=sharing
> > >.
> > > Please review and let me know if there are any comments or questions.
> > >
> > > This is a feature we have introduced internally, given our heavy volume
> > of
> > > interruptions. We have seen substantial decrease in task failures in both
> > > Flink and Spark jobs, and think the community would also benefit from
> > this
> > > :)
> > >
> > > Looking forward to getting feedback from the community.
> > >
> > > Thanks,
> > > Aravind
> > >
> > >  CIP 17: Interruption Aware Slot Selection
> > > <
> > https://drive.google.com/open?id=16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw
> > >
> > >
> >
> 
> 
> -- 
> Aravind K. Patnam
> 

Reply via email to