[ 
https://issues.apache.org/jira/browse/FLINK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322743#comment-17322743
 ] 

Matthias commented on FLINK-21439:
----------------------------------

Hi [~bytesandwich] and sorry for not getting back to you earlier. The last week 
was a bit busy. About your questions:
{quote}
What exactly does "global" mean? It seems to often mean that the failure cannot 
be associated with a task name. It seems that it often also has a separate 
meaning - that a failure will trigger a complete restart rather than one 
restricted to a subgraph of the job topology?
{quote}
Yes, global means that the entire ExecutionGraph gets restarted. In the context 
of the AdaptiveScheduler global failover means that the ExecutionGraph gets 
recreated. The {{AdaptiveScheduler}} does not support partial/local failovers 
as already mentioned in a previous comment. In the context of the 
{{DefaultScheduler}} global failovers are triggered by Flink. No task is 
attached in that case. Local failovers have a root cause being caught in task. 
Depending on the restart strategy used, it could mean that only one, a few or 
all executions of the {{ExecutionGraph}} need to be restarted.

{quote}
Maybe the AdaptiveScheduler can still emit failures associated with their task 
names for users to see, even if it will not yet trigger non global restarts. It 
appears updateTaskExecutionState will be passed errors where it's possible to 
collect the necessary information, from this 
testFailureReportedViaUpdateTaskExecutionStateCausesRestart test

To your point of populating the collection/queue: both the states or the 
scheduler have access to those relevant methods. Like you said many of the 
states seem to not yet have complete handling of the exceptions users would 
likely want to see in the GUI etc and the states are likely to evolve and 
change.

Maybe collecting exceptions in the scheduler would let us expose visibility 
regardless of state's behavior and it will also be coherent with the queue in 
the scheduler. From similar code from DefaultScheduler it seems feasible to do 
something like the following in AdaptiveScheduler to get the necessary inputs 
for ExceptionHistoryEntryExtractor::extractLocalFailure and then analagously 
for handleGlobalFailure:
{quote}
For me, collecting the data in the {{AdaptiveScheduler}} (analogous to the 
{{DefaultScheduler}}) makes sense.

{quote}
Regarding the comment in that code in the conditional early return block, what 
is the right information to say a transition isn't a failure that should or can 
be collected? Is it adequate to check whether the error is nonnull in the 
TaskExecutionStateTransition?
{quote}
An {{Execution}} is failed when it's in {{FAILED}} state. Naturally, there 
should be an invariant to always have a {{Throwable}} attached as well. But 
there's a bug in the {{Task}} implementation (see 
[FLINK-21376|https://issues.apache.org/jira/browse/FLINK-21376]) that might 
lead to the error cause (i.e. the {{Throwable}} not being around). We have a 
workaround for this right now in {{ErrorInfo}} but would like to move it out 
into the {{Task}} implementation (see 
[FLINK-22060|https://issues.apache.org/jira/browse/FLINK-22060]).

Maybe, it would make sense to create a PR draft and discuss code in there if 
you have already some implementation. The Github UI might be more convenient to 
discuss code than Jira.

> Add support for exception history
> ---------------------------------
>
>                 Key: FLINK-21439
>                 URL: https://issues.apache.org/jira/browse/FLINK-21439
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Matthias
>            Assignee: John Phelan
>            Priority: Major
>             Fix For: 1.13.0
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> {{SchedulerNG.requestJob}} returns an {{ExecutionGraphInfo}} that was 
> introduced in FLINK-21188. This {{ExecutionGraphInfo}} holds the information 
> about the {{ArchivedExecutionGraph}} and exception history information. 
> Currently, it's a list of {{ErrorInfos}}. This might change due to ongoing 
> work in FLINK-21190 where we might introduced a wrapper class with more 
> information on the failure.
> The goal of this ticket is to implement the exception history for the 
> {{AdaptiveScheduler}}, i.e. collecting the exceptions that caused restarts. 
> This collection of failures should be forwarded through 
> {{SchedulerNG.requestJob}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to