Bumping this thread kindly. Thanks! Best, Yuepeng Pan
At 2025-08-13 14:52:26, "Yuepeng Pan" <panyuep...@apache.org> wrote: Hi, Matthias, Thank you very much for your comments! I have carefully read your reply and made some changes in the hope of making improvements. Please help take a look. For your comments: > 1. You mention a few options for when it comes to storing the data which is > good. The FLIP doesn't point out, though, what option you're going to go > for as part of this FLIP (as far as I can see). It would be good to only > outline the option to go for in the FLIP and list the other options as > rejected alternatives (with the pro's and con's). I think it make sense to > go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore > for now). The other options can be considered as a follow-up. This is very meaningful. Based on this comment, I have kept option 3 in its original place and moved the other candidate options to [1]. > 2. About the terminal states of a rescaling (i.e. IGNORED, FAILED, > COMPLETED): Can we we clarify in the FLIP under what conditions the > rescaling transitions into each of the three terminal states? Yes, this is a reasonable request for understanding and explaining the logic of transitions to terminated states. A new subsection [2] has been added to address this. > 3. The section "The information to record in a rescale event" could be > restructured in four sections (to remove redundancy): > a) The IDs (Rescale > ID, resourceRequirementsEpochID, subRescaleIdOfResourceRequirementsEpochID): > What about making these names easier to read: GlobalRescaleID, RescaleUUID, > RescaleAttemptId) > b) Per-vertex data which includes: JobVertexID, JobVertexName, > SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient, > desired, post-rescale) > c) The SlotSharingGroup information: SlotSharingGroupId, name, > ResourceProfile > d) Other information: Timestamps of state transitions, etc. as laid out in > the FLIP already That makes sense to me. Please check [3] for the latest updates in this part. > 4. The FLIP doesn't explain how the data is passed through the > AdaptiveScheduler states. We should be handling some kind of > RescaleSnapshot that is passed through the different states and updated and > its final state is stored somewhere within AdaptiveScheduler in the end, I > guess. Can we clarify that in the FLIP? Indeed — this was missing in the original FLIP. To address this, I have added [4], which focuses on describing how a Rescale is represented, and how we can quickly pass and maintain the Rescale history. > 5. You mention the config parameters for the cache in the public interface > section. But there's no mentioning of any caching and how that is used > within the FLIP. Sorry for the rough description in the previous version. Since this part belongs to the REST API acceleration mechanism for rescaling, and Option 6 seems reasonable to me, I plan to add it to FLIP-487 once the design of FLIP-495 has reached consensus. Of course, if needed, I'd be happy to clarify the usage and purpose of this parameter in the current email thread. > 6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should > be about the actual implementation details and how the data is stored > internally whereas FLIP-487 is about exposing the information to the > outside through the REST API and the Flink UI. That would be a way to > decrease the scope of FLIP-495. WDYT? That sounds nice to me. Therefore, I have moved all REST API–related changes to FLIP-487. BTW, to avoid repetitive changes in FLIP-487, I'll start organizing FLIP-487 after FLIP-495 has been finalized. Looking forward to your next review! [1]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Aboutrescaleeventsstorage.1 [2]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-ThemainscenarioswhereRescalestatusswitchestoterminated [3]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Theinformationtorecordinarescaleevent [4]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-InternalInterfaces Best regards, Yuepeng Pan At 2025-08-10 23:54:37, "Matthias Pohl" <map...@apache.org> wrote: >Hi Yuepeng, >thanks for reminding me of this FLIP. I went over it and have a few items >which we might need to address before we can actually finalize the vote: > >1. You mention a few options for when it comes to storing the data which is >good. The FLIP doesn't point out, though, what option you're going to go >for as part of this FLIP (as far as I can see). It would be good to only >outline the option to go for in the FLIP and list the other options as >rejected alternatives (with the pro's and con's). I think it make sense to >go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore >for now). The other options can be considered as a follow-up. >2. About the terminal states of a rescaling (i.e. IGNORED, FAILED, >COMPLETED): Can we we clarify in the FLIP under what conditions the >rescaling transitions into each of the three terminal states? >3. The section "The information to record in a rescale event" could be >restructured in four sections (to remove redundancy): > a) The IDs (Rescale >ID, resourceRequirementsEpochID, subRescaleIdOfResourceRequirementsEpochID): >What about making these names easier to read: GlobalRescaleID, RescaleUUID, >RescaleAttemptId) > b) Per-vertex data which includes: JobVertexID, JobVertexName, >SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient, >desired, post-rescale) > c) The SlotSharingGroup information: SlotSharingGroupId, name, >ResourceProfile > d) Other information: Timestamps of state transitions, etc. as laid out in >the FLIP already >4. The FLIP doesn't explain how the data is passed through the >AdaptiveScheduler states. We should be handling some kind of >RescaleSnapshot that is passed through the different states and updated and >its final state is stored somewhere within AdaptiveScheduler in the end, I >guess. Can we clarify that in the FLIP? >5. You mention the config parameters for the cache in the public interface >section. But there's no mentioning of any caching and how that is used >within the FLIP. >6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should >be about the actual implementation details and how the data is stored >internally whereas FLIP-487 is about exposing the information to the >outside through the REST API and the Flink UI. That would be a way to >decrease the scope of FLIP-495. WDYT? > >Best, >Matthias > > >On Mon, Mar 24, 2025 at 11:37 AM Yuepeng Pan <panyuep...@apache.org> wrote: > >> Hi, Community, >> >> There haven’t been any further responses to this email over the past few >> days. >> I'd like to initiate a vote on the current proposal[1] in the next few >> days. >> Please rest assured that I’m proceeding cautiously and not rushing the >> process. >> If there are any concerns about this FLIP-495[1], >> I will gladly pause and make the adjustments. >> >> Best regards, >> Yuepeng Pan >> >> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history >> >> >> On 2024/12/17 15:18:45 Yuepeng Pan wrote: >> > Hi community, >> > >> > >> > >> > >> > We discussed several aspects of FLIP-487[1] 'Show history of rescales in >> Web UI for AdaptiveScheduler' >> > and received a lot of valuable feedback. Based on the suggestions from >> the email thread[2], >> > we plan to split the original proposal for FLIP-487[1]. >> > >> > >> > >> > >> > The current email thread and the FLIP-495[3] wiki will be used to >> discuss 'Support AdaptiveScheduler in recording and querying the rescale >> history', >> > while FLIP-487[1] will primarily focus on displaying-related design >> content >> > >> > >> > >> > >> > Looking forward to any feedback and opinions on FLIP-495[3]. >> > >> > >> > >> > >> > [1] >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler >> > >> > [2] https://lists.apache.org/thread/f4md4btkf006mxcxf66bng1kfz0rsn8c >> > >> > [3] >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history >> > >> > >> > >> > >> > Thank you very much. >> > >> > >> > >> > >> > Best, >> > >> > Regards. >> > >> > Yuepeng Pan >>