[
https://issues.apache.org/jira/browse/FLINK-6042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269499#comment-17269499
]
Matthias edited comment on FLINK-6042 at 1/22/21, 10:47 AM:
------------------------------------------------------------
{quote}Taking your argument, why is it better to add the exception information
method to the {{ArchivedExecutionGraph}} and making it thereby accessible to
all {{AbstractExecutionGraphHandler}} handlers? Wouldn't it make sense to only
provide access to those information a handler needs? In our case, one could
give access to the {{AccessExecutionGraph}} for those handlers which extract
information from the {{ExecutionGraph}} and maybe something like a
{{FailureHistory}} for the {{JobExceptionsHandler}}? In the end the
{{ArchivedExecutionGraph}} might also implement {{FailureHistory}} but I think
the important bit is to segregate the interfaces.
{quote}
Good point: Having a separated interface sounds like the better approach.
{quote}Thinking a step ahead, how would it work with the
{{ArchivedExecutionGraph}} if we send multiple graphs because it changed over
the job's lifetime. To which graph will the exception causing the lifetime end
of a graph be assigned?
{quote}
As we have a list of {{ArchivedExecutionGraphs}} in chronological order, I
would assume that any instance except for the last one have failureCause that
triggered the instantiation of a new {{ExecutionGraph}}. If no failure cause is
given it means that the instantiation happened due to some rescaling efforts
(alternatively, we could think of a new state to make that more explicit?). The
most recent {{ExecutionGraph}} is then either holding the failure caused the
job to fail or no failure cause if the job is in a non-failed state.
But considering that we might want to handover a list of
{{ArchivedExecutionGraphs}} in the future it would be worth it again to have a
class holding the {{ArchivedExecutionGraph}} (or later a list of
{{ArchivedExecutionGraphs}}) which implements {{FailureHistory}} as well.
was (Author: mapohl):
{quote}Taking your argument, why is it better to add the exception information
method to the {{ArchivedExecutionGraph}} and making it thereby accessible to
all {{AbstractExecutionGraphHandler}} handlers? Wouldn't it make sense to only
provide access to those information a handler needs? In our case, one could
give access to the {{AccessExecutionGraph}} for those handlers which extract
information from the {{ExecutionGraph}} and maybe something like a
{{FailureHistory}} for the {{JobExceptionsHandler}}? In the end the
{{ArchivedExecutionGraph}} might also implement {{FailureHistory}} but I think
the important bit is to segregate the interfaces.
{quote}
Good point: Having a separated interface sounds like the better approach.
{quote}Thinking a step ahead, how would it work with the
{{ArchivedExecutionGraph}} if we send multiple graphs because it changed over
the job's lifetime. To which graph will the exception causing the lifetime end
of a graph be assigned?
{quote}
As we have a list of {{ArchivedExecutionGraphs}} in chronological order, I
would assume that any instance except for the last one have failureCause that
triggered the instantiation of a new {{ExecutionGraph}}. If no failure cause is
given it means that the instantiation happened due to some rescaling efforts
(alternatively, we could think of a new state to make that more explicit?). The
most recent {{ExecutionGraph}} is then either holding the failure caused the
job to fail or no failure cause if the job is in a non-failed state.
> Display last n exceptions/causes for job restarts in Web UI
> -----------------------------------------------------------
>
> Key: FLINK-6042
> URL: https://issues.apache.org/jira/browse/FLINK-6042
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination, Runtime / Web Frontend
> Affects Versions: 1.3.0
> Reporter: Till Rohrmann
> Assignee: Matthias
> Priority: Major
> Labels: pull-request-available
>
> Users requested that it would be nice to see the last {{n}} exceptions
> causing a job restart in the Web UI. This will help to more easily debug and
> operate a job.
> We could store the root causes for failures similar to how prior executions
> are stored in the {{ExecutionVertex}} using the {{EvictingBoundedList}} and
> then serve this information via the Web UI.
> _-- Update: January 21, 2021 --_
> The UI can already handle multiple exceptions through the Exception History.
> Right now, we list one or more exceptions which caused the job to fail.
> Instead, we could adapt it in a way that the history contains not only the
> exceptions of the most recent failure but one expandable entry per restart.
> If there are more than one exception connected to a single restart, we would
> list their stacktraces within one expandable entry.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)