Thanks Yuepeng for your response. I added my comments to the individual paragraphs below:
> Thank you very much for the reminding The proposal makes sense to me. > Additionally, I'd like to confirm whether each rescale cycle/event > requires a status field, such as FAILED, IGNORED, SUCCESS, PENDING, etc. If > such state fields are not needed, how do we record that a particular > rescale request was ignored? Or do we not care about this situation and > only plan to record successful rescale events? > Ignored events (due to the AdaptiveScheduler not being in Executing state) can be still collected by the AdaptiveScheduler in updateJobResourceRequirements [1]. That's where the rescale history could be populated. Also keep in mind that it's not only the JobResourceRequirements update that can trigger a rescale. Updating the available resources through newResourcesAvailable [2] can also initiate a rescale. I don't see this being clearly laid out in FLIP-495, yet. [1] https://github.com/apache/flink/blob/a5258a015d553196d34e082e75ca4ae916addbaf/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L1115 [2] https://github.com/apache/flink/blob/fc0ccf325527c3589d5cd5ae7397b22c22321cec/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/ResourceListener.java#L28 Yes, this could happen during the reserving slots phase, and it's important > to record this. In my limited read, this is feasible. We can collect the > exceptions from this phase while gathering the scheduler state history, and > record the specific information using the previously mentioned exception > field and comments field, or use failed status as the final status of the > rescale event/record, WDYTA? > I struggle to follow why we put this in a comment and/or exception field. Shouldn't we have the information provided in structured fields for each event? Something like: - pre-rescale resources - on-rescale-trigger sufficient resources - on-rescale-trigger desired resources - actual resources used when rescaling finished (i.e. reaching Executing state) That would give a clear summary of what was available when deciding to rescale and what was used in the end. WDYT? > > I feel like on disk approach (analogously to the exception > >history) makes the most sense here. WDYT? > > Sorry,Matthias, IIUC,. If the storage mechanism here is similar to that of > the exception history, then we should choose the DFS approach, such as > HDFS. Please correct me if I’m wrong. > I agree that we might want to have this information being stored in DFS. That way the information would survive a JobManager failover. It would be still nice to have all three options being reflected in the FLIP, though, with the pro's and con's having the feature properly documented. As a side note: There is also FLIP-360 [3] which proposes merging the ExecutionGraphInfoStore and the JobResultStore into a single component. That would give us a single store for any completed job that could include the job's result, its exception history and the rescale history. No need to rely on the HistoryServer anymore. But that's out-of-scope for FLIP-495. But implementing the JM-local-disk approach and working on FLIP-360 would be another option. [3] https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore > BTW, the subsequent FLIP content will be maintained in the wiki page, and > the version in Google Docs will be deprecated. > A few other items I'd like to point out are the following ones: - Is the "Rescale Event" section still out-dated or do we have a different understanding of what a rescale operation is? Based on what I pointed out in my previous post, I would think that a rescale operation has its starting point in the AdaptiveScheduler's Executing state (i.e. when the job is running). That's how it is implemented right now. The "Rescale Event" section always starts from WaitingForResources, though. Do we disagree here? - Rescale status: I would say that this section needs to be reworked. For instance: - Why is a rescale having the STARTING status when the scheduler is in StopWithSavepoint state? We shouldn't rescale if the user wants to stop the job? Right now, rescale events are ignored in StopWithSavepoint. - The section seems to be out-dated. I already pointed out in my previous message that the rescale operation doesn't touch WaitingForResources but goes from Restarting straight to CreatingExecutionGraph. - Can the content of this section be visualized in a proper control flow diagram, instead? That might help understanding the goal of this section a bit more. - The Rescale ID section: - You're talking about resourceRequirementsEpochID, rescale ID and rescale attempt here: The resourceRequirementsEpochID is used as a general UUID, the attempt ID is a monotonically increasing number in the scope of a single resource requirements update. And the rescale ID is a globally monotonically increasing ID? Can you elaborate on the purpose of each of the IDs? Why do we need two globally-scoped IDs here? - Can you document what an attempt is? Is a failed attempt defined by the outcome of the rescale operation (i.e. the desired parallelism isn't reached)? Generally, it might help to add more diagrams (especially for documenting state machines). That might be easier to understand than plain text. I'm looking forward to your response. Best, Matthias