[
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
]
Till Rohrmann edited comment on FLINK-6042 at 1/21/21, 9:17 AM:
----------------------------------------------------------------
We have two approach (which we discussed offline) to implement this feature:
# The {{JobExceptionsHandler}} does most of the work by iterating over the
{{ArchivedExecutions}} of the passed {{ArchivedExecutionGraph}}.
{{ArchivedExecutions}} provide the time (through
{{ArchivedExecution.stateTimestamps}}) and the thrown exception
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and
pass it over to the {{JobExceptionsHandler}} along the
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions
happened due to the same failure case.
** +Pros:+
*** This approach has the advantage of using mostly code that is already there.
*** No extra code in the {{SchedulerBase}} implementation.
** Cons:
*** It does not support restarts of the {{ExecutionGraph}}. This restart
functionality is planned for the declarative scheduler which we're currently
working on (see
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is
provided.
*** There might be modifications necessary to the internally used data
structures allowing random access based on {{ExecutionAttemptID}} instead of
iterating over collections.
# The collection of exceptions happens in the scheduler. The mapping of root
cause to related exceptions is then passed over to the
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
** +Pros:+
*** It makes makes it easier to port this functionality into the declarative
scheduler of FLIP-160. We don't need to think of a history of
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are
hidden away from the {{JobExceptionsHandler}}
** +Cons:+
*** The {{SchedulerBase}} code base grows once more which increases complexity.
We decided to go with option 2 for now. This makes it easier for us to
implement the functionality into the declarative scheduler of FLIP-160.
was (Author: mapohl):
We have two approach (which we discussed offline) to implement this feature:
# The {{JobExceptionsHandler}} does most of the work by iterating over the
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}.
{{ArchivedExecutions}} provide the time (through
{{ArchivedExecution.stateTimestamps}}) and the thrown exception
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and
pass it over to the {{JobExceptionsHandler}} along the
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions
happened due to the same failure case.
** +Pros:+
*** This approach has the advantage of using mostly code that is already there.
*** No extra code in the {{SchedulerBase}} implementation.
** Cons:
*** It does not support restarts of the {{ExecutionGraph}}. This restart
functionality is planned for the declarative scheduler which we're currently
working on (see
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is
provided.
*** There might be modifications necessary to the internally used data
structures allowing random access based on {{ExecutionAttemptID}} instead of
iterating over collections.
# The collection of exceptions happens in the scheduler. The mapping of root
cause to related exceptions is then passed over to the
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
** +Pros:+
*** It makes makes it easier to port this functionality into the declarative
scheduler of FLIP-160. We don't need to think of a history of
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are
hidden away from the {{JobExceptionsHandler}}
** +Cons:+
*** The {{SchedulerBase}} code base grows once more which increases complexity.
We decided to go with option 2 for now. This makes it easier for us to
implement the functionality into the declarative scheduler of FLIP-160.
> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination, Runtime / Web Frontend
> Affects Versions: 1.3.0
> Reporter: Till Rohrmann
> Assignee: Matthias
> Priority: Major
> Labels: pull-request-available
>
> Users requested that it would be nice to see the last {{n}} exceptions
> causing a job restart in the Web UI. This will help to more easily debug and
> operate a job.
> We could store the root causes for failures similar to how prior executions
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and
> then serve this information via the Web UI.
> _-- Update: January 21, 2021 --_
> The UI can already handle multiple exceptions through the Exception History.
> Right now, we list one or more exceptions which caused the job to fail.
> Instead, we could adapt it in a way that the history contains not only the
> exceptions of the most recent failure but one expandable entry per restart.
> If there are more than one exception connected to a single restart, we would
> list their stacktraces within one expandable entry.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)