Re: [DISCUSS] FLIP-495: Support AdaptiveScheduler record and query the rescale history

Matthias Pohl Thu, 02 Jan 2025 09:16:36 -0800

Thanks Yuepeng for your response. I added my comments to the individual
paragraphs below:



> Thank you very much for the reminding The proposal makes sense to me.
> Additionally, I'd like to confirm whether each rescale cycle/event
> requires a status field, such as FAILED, IGNORED, SUCCESS, PENDING, etc. If
> such state fields are not needed, how do we record that a particular
> rescale request was ignored? Or do we not care about this situation and
> only plan to record successful rescale events?
>

Ignored events (due to the AdaptiveScheduler not being in Executing state)
can be still collected by the AdaptiveScheduler in
updateJobResourceRequirements [1]. That's where the rescale history could
be populated.

Also keep in mind that it's not only the JobResourceRequirements update
that can trigger a rescale. Updating the available resources through
newResourcesAvailable [2] can also initiate a rescale. I don't see this
being clearly laid out in FLIP-495, yet.

[1]
https://github.com/apache/flink/blob/a5258a015d553196d34e082e75ca4ae916addbaf/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L1115

[2]
https://github.com/apache/flink/blob/fc0ccf325527c3589d5cd5ae7397b22c22321cec/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/ResourceListener.java#L28

Yes, this could happen during the reserving slots phase, and it's important
> to record this. In my limited read, this is feasible. We can collect the
> exceptions from this phase while gathering the scheduler state history, and
> record the specific information using the previously mentioned exception
> field and comments field, or use failed status  as the final status of the
> rescale event/record, WDYTA?
>

I struggle to follow why we put this in a comment and/or exception field.
Shouldn't we have the information provided in structured fields for each
event? Something like:
- pre-rescale resources
- on-rescale-trigger sufficient resources
- on-rescale-trigger desired resources
- actual resources used when rescaling finished (i.e. reaching Executing
state)
 That would give a clear summary of what was available when deciding to
rescale and what was used in the end. WDYT?


> > I feel like on disk approach (analogously to the exception
> >history) makes the most sense here. WDYT?
>
> Sorry，Matthias, IIUC,. If the storage mechanism here is similar to that of
> the exception history, then we should choose the DFS approach, such as
> HDFS. Please correct me if I’m wrong.
>

I agree that we might want to have this information being stored in DFS.
That way the information would survive a JobManager failover. It would be
still nice to have all three options being reflected in the FLIP, though,
with the pro's and con's having the feature properly documented.

As a side note: There is also FLIP-360 [3] which proposes merging the
ExecutionGraphInfoStore and the JobResultStore into a single component.
That would give us a single store for any completed job that could include
the job's result, its exception history and the rescale history. No need to
rely on the HistoryServer anymore. But that's out-of-scope for FLIP-495.
But implementing the JM-local-disk approach and working on FLIP-360 would
be another option.

[3]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore


> BTW, the subsequent FLIP content will be maintained in the wiki page, and
> the version in Google Docs will be deprecated.
>

A few other items I'd like to point out are the following ones:
- Is the "Rescale Event" section still out-dated or do we have a different
understanding of what a rescale operation is? Based on what I pointed out
in my previous post, I would think that a rescale operation has its
starting point in the AdaptiveScheduler's Executing state (i.e. when the
job is running). That's how it is implemented right now. The "Rescale
Event" section always starts from WaitingForResources, though. Do we
disagree here?

- Rescale status: I would say that this section needs to be reworked. For
instance:
  - Why is a rescale having the STARTING status when the scheduler is in
StopWithSavepoint state? We shouldn't rescale if the user wants to stop the
job? Right now, rescale events are ignored in StopWithSavepoint.
  - The section seems to be out-dated. I already pointed out in my previous
message that the rescale operation doesn't touch WaitingForResources but
goes from Restarting straight to CreatingExecutionGraph.
  - Can the content of this section be visualized in a proper control flow
diagram, instead? That might help understanding the goal of this section a
bit more.

- The Rescale ID section:
  - You're talking about resourceRequirementsEpochID, rescale ID and
rescale attempt here: The resourceRequirementsEpochID is used as a general
UUID, the attempt ID is a monotonically increasing number in the scope of a
single resource requirements update. And the rescale ID is a globally
monotonically increasing ID? Can you elaborate on the purpose of each of
the IDs? Why do we need two globally-scoped IDs here?
  - Can you document what an attempt is? Is a failed attempt defined by the
outcome of the rescale operation (i.e. the desired parallelism isn't
reached)?

Generally, it might help to add more diagrams (especially for documenting
state machines). That might be easier to understand than plain text.

I'm looking forward to your response.

Best,
Matthias

Re: [DISCUSS] FLIP-495: Support AdaptiveScheduler record and query the rescale history

Reply via email to