[
https://issues.apache.org/jira/browse/FLINK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314927#comment-17314927
]
Matthias edited comment on FLINK-21439 at 4/5/21, 3:27 PM:
-----------------------------------------------------------
Hi John,
thank you for your proposal. You correctly identified
{{SchedulerNG.handleGlobalFailure}} and
{{SchedulerNG.updateTaskExecutionState}} as the entry points for failure
handling. This also apply to the {{AdaptiveScheduler}}.
About proposing to use the {{ExceptionHistoryEntry}}'s static {{from*}} factory
methods: There was some work done as part of FLINK-21189 that got recently
merged. I should have pinged you on that one. Sorry for that. A new class
{{ExceptionHistoryEntryExtractor}} was introduced that deals with collecting
all relevant information from the {{ExecutionGraph}} to create
{{RootExceptionHistoryEntry}} instances. This enables us to handle failures
that were caught while handling already another failure.
The {{AdaptiveScheduler}} only deals with global fail overs for now (see the
corresponding
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler]),
i.e. all failures are global failures ([~rmetzger] please correct me if I'm
wrong here). Concurrent failures can still happen, though. These failures are
"swallowed" in the
[Restarting|https://github.com/apache/flink/blob/ca968d305a99b63162136589e1d9f6ba4c9cdd2b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java#L78-L86]
state. We might want to collect these failures and add it to the corresponding
{{RootExceptionHistoryEntry}}.
Having the {{BoundedFIFOQueue}} in the {{AdaptiveScheduler}} class makes sense
to me. But there needs to be a way for the {{State}} implementation to populate
that collection.
Does this makes sense to you?
was (Author: mapohl):
Hi John,
thank you for your proposal. You correctly identified
{{SchedulerNG.handleGlobalFailure}} or {{SchedulerNG.updateTaskExecutionState}}
as the entry points for failure handling. This also apply to the
{{AdaptiveScheduler}}.
About proposing to use the {{ExceptionHistoryEntry}}'s static {{from*}} factory
methods: There was some work done as part of FLINK-21189 that got recently
merged. I should have pinged you on that one. Sorry for that. A new class
{{ExceptionHistoryEntryExtractor}} was introduced that deals with collecting
all relevant information from the {{ExecutionGraph}} to create
{{RootExceptionHistoryEntry}} instances. This enables us to handle failures
that were caught while handling already another failure.
The {{AdaptiveScheduler}} only deals with global fail overs for now (see the
corresponding
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler]),
i.e. all failures are global failures ([~rmetzger] please correct me if I'm
wrong here). Concurrent failures can still happen, though. These failures are
"swallowed" in the
[Restarting|https://github.com/apache/flink/blob/ca968d305a99b63162136589e1d9f6ba4c9cdd2b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java#L78-L86]
state. We might want to collect these failures and add it to the corresponding
{{RootExceptionHistoryEntry}}.
Having the {{BoundedFIFOQueue}} in the {{AdaptiveScheduler}} class makes sense
to me. But there needs to be a way for the {{State}} implementation to populate
that collection.
Does this makes sense to you?
> Add support for exception history
> ---------------------------------
>
> Key: FLINK-21439
> URL: https://issues.apache.org/jira/browse/FLINK-21439
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Matthias
> Assignee: John Phelan
> Priority: Major
> Fix For: 1.13.0
>
> Time Spent: 3h
> Remaining Estimate: 0h
>
> {{SchedulerNG.requestJob}} returns an {{ExecutionGraphInfo}} that was
> introduced in FLINK-21188. This {{ExecutionGraphInfo}} holds the information
> about the {{ArchivedExecutionGraph}} and exception history information.
> Currently, it's a list of {{ErrorInfos}}. This might change due to ongoing
> work in FLINK-21190 where we might introduced a wrapper class with more
> information on the failure.
> The goal of this ticket is to implement the exception history for the
> {{AdaptiveScheduler}}, i.e. collecting the exceptions that caused restarts.
> This collection of failures should be forwarded through
> {{SchedulerNG.requestJob}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)