[
https://issues.apache.org/jira/browse/FLINK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322743#comment-17322743
]
Matthias commented on FLINK-21439:
----------------------------------
Hi [~bytesandwich] and sorry for not getting back to you earlier. The last week
was a bit busy. About your questions:
{quote}
What exactly does "global" mean? It seems to often mean that the failure cannot
be associated with a task name. It seems that it often also has a separate
meaning - that a failure will trigger a complete restart rather than one
restricted to a subgraph of the job topology?
{quote}
Yes, global means that the entire ExecutionGraph gets restarted. In the context
of the AdaptiveScheduler global failover means that the ExecutionGraph gets
recreated. The {{AdaptiveScheduler}} does not support partial/local failovers
as already mentioned in a previous comment. In the context of the
{{DefaultScheduler}} global failovers are triggered by Flink. No task is
attached in that case. Local failovers have a root cause being caught in task.
Depending on the restart strategy used, it could mean that only one, a few or
all executions of the {{ExecutionGraph}} need to be restarted.
{quote}
Maybe the AdaptiveScheduler can still emit failures associated with their task
names for users to see, even if it will not yet trigger non global restarts. It
appears updateTaskExecutionState will be passed errors where it's possible to
collect the necessary information, from this
testFailureReportedViaUpdateTaskExecutionStateCausesRestart test
To your point of populating the collection/queue: both the states or the
scheduler have access to those relevant methods. Like you said many of the
states seem to not yet have complete handling of the exceptions users would
likely want to see in the GUI etc and the states are likely to evolve and
change.
Maybe collecting exceptions in the scheduler would let us expose visibility
regardless of state's behavior and it will also be coherent with the queue in
the scheduler. From similar code from DefaultScheduler it seems feasible to do
something like the following in AdaptiveScheduler to get the necessary inputs
for ExceptionHistoryEntryExtractor::extractLocalFailure and then analagously
for handleGlobalFailure:
{quote}
For me, collecting the data in the {{AdaptiveScheduler}} (analogous to the
{{DefaultScheduler}}) makes sense.
{quote}
Regarding the comment in that code in the conditional early return block, what
is the right information to say a transition isn't a failure that should or can
be collected? Is it adequate to check whether the error is nonnull in the
TaskExecutionStateTransition?
{quote}
An {{Execution}} is failed when it's in {{FAILED}} state. Naturally, there
should be an invariant to always have a {{Throwable}} attached as well. But
there's a bug in the {{Task}} implementation (see
[FLINK-21376|https://issues.apache.org/jira/browse/FLINK-21376]) that might
lead to the error cause (i.e. the {{Throwable}} not being around). We have a
workaround for this right now in {{ErrorInfo}} but would like to move it out
into the {{Task}} implementation (see
[FLINK-22060|https://issues.apache.org/jira/browse/FLINK-22060]).
Maybe, it would make sense to create a PR draft and discuss code in there if
you have already some implementation. The Github UI might be more convenient to
discuss code than Jira.
> Add support for exception history
> ---------------------------------
>
> Key: FLINK-21439
> URL: https://issues.apache.org/jira/browse/FLINK-21439
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.13.0
> Reporter: Matthias
> Assignee: John Phelan
> Priority: Major
> Fix For: 1.13.0
>
> Time Spent: 3h
> Remaining Estimate: 0h
>
> {{SchedulerNG.requestJob}} returns an {{ExecutionGraphInfo}} that was
> introduced in FLINK-21188. This {{ExecutionGraphInfo}} holds the information
> about the {{ArchivedExecutionGraph}} and exception history information.
> Currently, it's a list of {{ErrorInfos}}. This might change due to ongoing
> work in FLINK-21190 where we might introduced a wrapper class with more
> information on the failure.
> The goal of this ticket is to implement the exception history for the
> {{AdaptiveScheduler}}, i.e. collecting the exceptions that caused restarts.
> This collection of failures should be forwarded through
> {{SchedulerNG.requestJob}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)