[
https://issues.apache.org/jira/browse/FLINK-29940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17635678#comment-17635678
]
Mingliang Liu commented on FLINK-29940:
---------------------------------------
Thanks for the comment [~gaoyunhaii] . We have been using a comprehensive
dashboard to show metrics including {{numberOfRestart}}. However, when we see
multiple restarts for an interval, it's not clear which subtask / taskmanager
caused the failure. This log is one of the most related places we can check the
exception stack of the cause. With so many INFO level logging, it's not
straightfoward to get this exact one. With ERROR level logging, it's simpler to
spot it with eyeballs as well as setting up alerts. I think in increasingly
more places, deployments are per-job-per-JM (e.g. application mode). -A job-
*The job* failure is an "error" events to the dedicated JM. If JM can not
recover from such error, it would be FATAL?
> ExecutionGraph logs job state change at ERROR level when job fails
> ------------------------------------------------------------------
>
> Key: FLINK-29940
> URL: https://issues.apache.org/jira/browse/FLINK-29940
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.16.0
> Reporter: Mingliang Liu
> Priority: Minor
> Labels: pull-request-available
>
> When job switched to FAILED state, the log is very useful to understand why
> it failed along with the root cause exception stack. However, the current log
> level is INFO - a bit inconvenient for users to search from logging with so
> many surrounding log lines. We can log at ERROR level when the job switched
> to FAILED state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)