[jira] [Commented] (FLINK-29940) ExecutionGraph logs job state change at ERROR level when job fails

Mingliang Liu (Jira) Thu, 17 Nov 2022 22:20:43 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635678#comment-17635678
 ]


Mingliang Liu commented on FLINK-29940:
---------------------------------------

Thanks for the comment [~gaoyunhaii] . We have been using a comprehensive 
dashboard to show metrics including {{numberOfRestart}}. However, when we see 
multiple restarts for an interval, it's not clear which subtask / taskmanager 
caused the failure. This log is one of the most related places we can check the 
exception stack of the cause. With so many INFO level logging, it's not 
straightfoward to get this exact one. With ERROR level logging, it's simpler to 
spot it with eyeballs as well as setting up alerts. I think in increasingly 
more places, deployments are per-job-per-JM (e.g. application mode). -A job- 
*The job* failure is an "error" events to the dedicated JM. If JM can not 
recover from such error, it would be FATAL?

> ExecutionGraph logs job state change at ERROR level when job fails
> ------------------------------------------------------------------
>
>                 Key: FLINK-29940
>                 URL: https://issues.apache.org/jira/browse/FLINK-29940
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0
>            Reporter: Mingliang Liu
>            Priority: Minor
>              Labels: pull-request-available
>
> When job switched to FAILED state, the log is very useful to understand why 
> it failed along with the root cause exception stack. However, the current log 
> level is INFO - a bit inconvenient for users to search from logging with so 
> many surrounding log lines. We can log at ERROR level when the job switched 
> to FAILED state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-29940) ExecutionGraph logs job state change at ERROR level when job fails

Reply via email to