[ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264894#comment-17264894
 ] 

Matthias commented on FLINK-6042:
---------------------------------

[~trohrmann] I'd suggest to collect the exceptions and the timestamp of their 
occurrence (using {{ErrorInfo}}) in the {{SchedulerNG}} implementations. For 
{{SchedulerBase}} we could use the 
[UpdateSchedulerNgOnInternalFailuresListener|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L605]
 to collect the exceptions. 
[ArchivedExecutionGraph.createFrom(..)|https://github.com/apache/flink/blob/ac968b83675e64712b4d35dbc166e09808c2156b/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L800]'s
 interface and 
[ArchivedExecutionGraph|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ArchivedExecutionGraph.java#L98]
 (which implements {{AccessExecutionGraph}}) needs to be extended to also cover 
the collection of exceptions.

We will have to extend {{AccessExecutionGraph}} to provide a method for 
returning the collected {{ErrorInfo}} instances. This newly introduced method 
can be used in 
[JobExceptionsHandler.createJobExceptionsInfo(..)|https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java#L99].
 The newly collected Exceptions can be added to {{JobExceptionsInfo}}'s 
{{allExceptions}} field.

Is this valid as the exceptions will be mixed with the task exceptions that are 
exposed through in the current version of the UI already? Or should the user be 
able to distinguish these exceptions?

> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
>                 Key: FLINK-6042
>                 URL: https://issues.apache.org/jira/browse/FLINK-6042
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Web Frontend
>    Affects Versions: 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Matthias
>            Priority: Major
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web UI. This will help to more easily debug and 
> operate a job.
> We could store the root causes for failures similar to how prior executions 
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and 
> then serve this information via the Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to