[
https://issues.apache.org/jira/browse/FLINK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855562#comment-16855562
]
vinoyang commented on FLINK-12683:
----------------------------------
Hi [~till.rohrmann] I think a good log message should describe a thing clearly
and provide enough information for locating the problem. For the most scene,
when we find a checkpoint's acknowledge timeout, we need to find it's task
manager location information, but I think these messages do not contain this
information. Regarding the other logs which provide the task manager location,
I still believe the opinion that replied to [~klion26] :
* Depending on log context message is not an effective way to locate the
problem, consider when the job's parallelism is very large there would be so
many executions;
* The log files not only been viewed but also be collected into some search
engine, when we query the log pattern, it's hard to find the relationship of
the log context messages directly;
* We have an alerting system, it will trigger alert action based on key log
message, for example, send these key log (checkpoint timeout, the job failure)
by SMS or IM tools, if we have more detailed information, we can quickly locate
the problems; otherwise we need to view the log file again;
> Provide task manager's location information for checkpoint coordinator
> specific log messages
> --------------------------------------------------------------------------------------------
>
> Key: FLINK-12683
> URL: https://issues.apache.org/jira/browse/FLINK-12683
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Reporter: vinoyang
> Assignee: vinoyang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Currently, the {{AcknowledgeCheckpoint}} does not contain the task manager's
> location information. When a task's snapshot task sends an ack message to the
> coordinator, we can only log this message:
> {code:java}
> Received late message for now expired checkpoint attempt 6035 from
> ccd88d08bf82245f3466c9480fb5687a of job 775ef8ff0159b071da7804925bbd362f.
> {code}
> Sometimes we need to get this sub task's location information to do the
> further debug work, e.g. stack trace dump. But, without the location
> information, It will not help to quickly locate the problem.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)