[
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
]
Matthias commented on FLINK-6042:
---------------------------------
We have two approach (which we discussed offline) to implement this feature:
# The {{JobExceptionsHandler}} does most of the work by iterating over the
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}.
{{ArchivedExecutions}} provide the time (through
{{ArchivedExecution.stateTimestamps}}) and the thrown exception
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and
pass it over to the {{JobExceptionsHandler}} along the
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions
happened due to the same failure case.
+Pros:+
- This approach has the advantage of using mostly code that is already there.
- No extra code in the {{SchedulerBase}} implementation.
+Cons:+
- It does not support restarts of the {{ExecutionGraph}}. This restart
functionality is planned for the declarative scheduler which we're currently
working on (see
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is
provided.
- There might be modifications necessary to the internally used data structures
allowing random access based on {{ExecutionAttemptID}} instead of iterating
over collections.
# The collection of exceptions happens in the scheduler. The mapping of root
cause to related exceptions is then passed over to the
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
+Pros:+
- It makes makes it easier to port this functionality into the declarative
scheduler of FLIP-160. We don't need to think of a history of
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are
hidden away from the {{JobExceptionsHandler}}
+Cons:+
- The {{SchedulerBase}} code base grows once more which increases complexity.
We decided to go with option 2 for now. This makes it easier for us to
implement the functionality into the declarative scheduler of FLIP-160.
> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination, Runtime / Web Frontend
> Affects Versions: 1.3.0
> Reporter: Till Rohrmann
> Assignee: Matthias
> Priority: Major
> Labels: pull-request-available
>
> Users requested that it would be nice to see the last {{n}} exceptions
> causing a job restart in the Web UI. This will help to more easily debug and
> operate a job.
> We could store the root causes for failures similar to how prior executions
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and
> then serve this information via the Web UI.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)