To add on to my answer for the 4th question -- internally we have this value to 50% given our interruption notice period is 24h.
Thanks, Aravind On Tue, May 13, 2025 at 2:46 PM Aravind Patnam <akpatna...@gmail.com> wrote: > Hi Nicholas, > > Thanks for the feedback and questions! Let me answer your questions below. > > > 1. Why is the lack of Flink replication support the motivation for > disruption aware slot selection? Do you mean that disruption aware slot > selection helps to reduce recompution costs for Flink without replication > support? > Yes, this would significantly reduce recomputation costs. This is > especially in an environment like ours internally, where we experience > constant interruptions throughout the day to a large set of workers. By > proactively only using workers that won't get interrupted, we can minimize > this recomputation cost and reduce the task failures. > > > 2. Can you provide a complete definition of the > /updateInterruptionNotice interface? Meanwhile, could you also provide the > definition of corresponding CLI interface? > Yes, sure - I have updated the CIP google doc to now include the interface > <https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?pli=1&tab=t.0#heading=h.96cl46gz9o8l> > now. In summary, it would be a PUT api, which would take in a List of > *UpdateInterruptionNoticeRequests. > *This object's fields would simply be the worker hostname along with the > next known interruption timestamp in milliseconds. Every time this API is > called, it will reset the workers not included in the request back to no > nextKnownDisruption automatically. > > > 3. How is the performance of disruption aware slot selection? Which > scenario could users use disruption aware slot selection? > Internally, we have enabled this feature and we see great improvements to > our task failures. You can see the results here > <https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?pli=1&tab=t.0#bookmark=id.asvbrvv0c9q6> > -- > we roughly see *20x decrease for Spark task failures related to > interruptions*. This is because we have a high rate of interruptions > internally. Other users with similar environments would greatly benefit > from this feature, as they can enable this to prioritize workers that will > not get interrupted. > > > 4. How could the cluster administrator determine > workersWithLateInterruptions and workersWithEarlyInterruptions? BTW, how > does the administrator evaluate the threshold percentile based on the range > of interruption timestamps? > This sort of depends on how the interruptions are scheduled and how much > advance notice is present, along with the application characteristics in > the cluster. We plan on distinguishing the workers as follows in the code: > > > - > > prioritizedWorkers = workersWithNoInterruptions + > workersWithLateInterruptions > - > > deprioritizedWorkers = workersWithEarlyInterruptions > > > The calculation is: > > - workersWithLateInterruptions = percentageThreshold * > sorted(totalWorkers) > - workersWithEarlyInterruptions = totalWorkers - > workersWithLateInterruptions > > > This means that the higher the percentage threshold, the higher the size > of workersWithLateInterruptions, which means the higher the number of > prioritizedWorkers. > > For example, if you have a larger advance notice (such as 1 week or so), > it makes sense to set this percentage threshold higher (e.g. 90%). If you > have a smaller advanced notice window, it makes sense to set it something > lower (e.g. 50%). > > Consider this example where there are 10 workers with interruption > timestamps t5-t14, and current time t0: > > w1 > > w2 > > w3 > > w4 > > w5 > > w6 > > w7 > > w8 > > w9 > > w10 > > t5 > > t6 > > t7 > > t8 > > t9 > > t10 > > t11 > > t12 > > t13 > > t14 > > If we set percentageThreshold to 50%, then the following would occur: > > - w1 - w5 would be considered workersWithEarlyInterruptions > - w6 - w10 would be considered workersWithLateInterruptions > > Given current time is only t0 and is far away from t5, this means we are > probably wasting workers w1 - w5 since they have a long time to be > disrupted anyways. Hence, in this case it would make more sense to set a > higher percentage threshold such as 90%. Then it would be like this: > > - w1 would be considered workersWithEarlyInterruptions > - w2 - w10 would be considered workersWithLateInterruptions > > This is better for the cluster given a larger advanced notice. > > However, if there is a smaller advanced notice, consider this example: > > w1 > > w2 > > w3 > > w4 > > w5 > > w6 > > w7 > > w8 > > w9 > > w10 > > t1 > > t2 > > t3 > > t4 > > t5 > > t6 > > t7 > > t8 > > t9 > > t10 > > If we set percentageThreshold to 90%, then the following would occur: > > - w1 would be considered workersWithEarlyInterruptions > - w2 - w10 would be considered workersWithLateInterruptions > > Given current time is only t0 and is close to t1 or t2, medium/long > running jobs would probably incur failures when w2-w10 go down. Hence in > this case, it might be better to set to a lower percentage, such as 50%. > Then it would be like this: > > - w1-5 would be considered workersWithEarlyInterruptions > - w6 - w10 would be considered workersWithLateInterruptions > > This is better for the cluster given a smaller advanced notice. > > Overall, the cluster administrator should take into account the > interruption advanced notice time, along with the average/median job > runtime in the cluster before setting this value. > We can add more documentation on how to set this value once we start > implementing it. > > Hope this answers all your questions! > > Thanks, > Aravind > > > On Tue, May 13, 2025 at 9:40 AM Nicholas Jiang <nicholasji...@apache.org> > wrote: > >> Hi Aravind, >> >> Thanks for driving the proposal of interruption aware slot selection. I >> have some comments for this proposal: >> >> 1. Why is the lack of Flink replication support the motivation for >> disruption aware slot selection? Do you mean that disruption aware slot >> selection helps to reduce recompution costs for Flink without replication >> support? >> >> 2. Can you provide a complete definition of the /updateInterruptionNotice >> interface? Meanwhile, could you also provide the definition of >> corresponding CLI interface? >> >> 3. How is the performance of disruption aware slot selection? Which >> scenario could users use disruption aware slot selection? >> >> 4. How could the cluster administrator determine >> workersWithLateInterruptions and workersWithEarlyInterruptions? BTW, how >> does the administrator evaluate the threshold percentile based on the range >> of interruption timestamps? >> >> Regards, >> Nicholas Jiang >> >> On 2025/05/02 07:40:15 Aravind Patnam wrote: >> > Hi Celeborn community, >> > >> > I have written up CIP-17: Interruption Aware Slot Selection >> > < >> https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?usp=sharing >> >. >> > Please review and let me know if there are any comments or questions. >> > >> > This is a feature we have introduced internally, given our heavy volume >> of >> > interruptions. We have seen substantial decrease in task failures in >> both >> > Flink and Spark jobs, and think the community would also benefit from >> this >> > :) >> > >> > Looking forward to getting feedback from the community. >> > >> > Thanks, >> > Aravind >> > >> > CIP 17: Interruption Aware Slot Selection >> > < >> https://drive.google.com/open?id=16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw >> > >> > >> > > > -- > Aravind K. Patnam > -- Aravind K. Patnam