[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

Till Rohrmann (Jira) Thu, 21 Jan 2021 01:18:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269092#comment-17269092
 ]


Till Rohrmann edited comment on FLINK-6042 at 1/21/21, 9:17 AM:
----------------------------------------------------------------

We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecutions}} of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.


was (Author: mapohl):
We have two approach (which we discussed offline) to implement this feature:
 # The {{JobExceptionsHandler}} does most of the work by iterating over the 
{{ArchivedExecution}}s of the passed {{ArchivedExecutionGraph}}. 
{{ArchivedExecutions}} provide the time (through 
{{ArchivedExecution.stateTimestamps}}) and the thrown exception 
({{ArchivedExecution.failureCause}}). The {{SchedulerNG}} implementation would 
need to collect a mapping of {{failureCause}} to {{ExecutionAttemptID}} and 
pass it over to the {{JobExceptionsHandler}} along the 
{{ArchivedExecutionGraph}}. This would enable the handler to group exceptions 
happened due to the same failure case.
 ** +Pros:+
 *** This approach has the advantage of using mostly code that is already there.
 *** No extra code in the {{SchedulerBase}} implementation.
 ** Cons:
 *** It does not support restarts of the {{ExecutionGraph}}. This restart 
functionality is planned for the declarative scheduler which we're currently 
working on (see 
[FLIP-160|https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Declarative+Scheduler]).
 Only the most recent {{ExecutionGraph}} (and, therefore, its exceptions) is 
provided.
 *** There might be modifications necessary to the internally used data 
structures allowing random access based on {{ExecutionAttemptID}} instead of 
iterating over collections.
 # The collection of exceptions happens in the scheduler. The mapping of root 
cause to related exceptions is then passed over to the 
{{JobExceptionsHandler}}. The exceptions can be collected as they appear.
 ** +Pros:+
 *** It makes makes it easier to port this functionality into the declarative 
scheduler of FLIP-160. We don't need to think of a history of 
{{ArchivedExecutionGraphs}} for now. Restart of the {{ExecutionGraph}} are 
hidden away from the {{JobExceptionsHandler}} 
 ** +Cons:+
 *** The {{SchedulerBase}} code base grows once more which increases complexity.

We decided to go with option 2 for now. This makes it easier for us to 
implement the functionality into the declarative scheduler of FLIP-160.

> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
>                 Key: FLINK-6042
>                 URL: https://issues.apache.org/jira/browse/FLINK-6042
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Web Frontend
>    Affects Versions: 1.3.0
>            Reporter: Till Rohrmann
>            Assignee: Matthias
>            Priority: Major
>              Labels: pull-request-available
>
> Users requested that it would be nice to see the last {{n}} exceptions 
> causing a job restart in the Web UI. This will help to more easily debug and 
> operate a job.
> We could store the root causes for failures similar to how prior executions 
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and 
> then serve this information via the Web UI.
> _-- Update: January 21, 2021 --_
> The UI can already handle multiple exceptions through the Exception History. 
> Right now, we list one or more exceptions which caused the job to fail. 
> Instead, we could adapt it in a way that the history contains not only the 
> exceptions of the most recent failure but one expandable entry per restart. 
> If there are more than one exception connected to a single restart, we would 
> list their stacktraces within one expandable entry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-6042) Display last n exceptions/causes for job restarts in Web UI

Reply via email to