[ 
https://issues.apache.org/jira/browse/FLINK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855562#comment-16855562
 ] 

vinoyang commented on FLINK-12683:
----------------------------------

Hi [~till.rohrmann] I think a good log message should describe a thing clearly 
and provide enough information for locating the problem. For the most scene, 
when we find a checkpoint's acknowledge timeout, we need to find it's task 
manager location information, but I think these messages do not contain this 
information. Regarding the other logs which provide the task manager location, 
I still believe the opinion that replied to [~klion26] :
 * Depending on log context message is not an effective way to locate the 
problem, consider when the job's parallelism is very large there would be so 
many executions;
 * The log files not only been viewed but also be collected into some search 
engine, when we query the log pattern, it's hard to find the relationship of 
the log context messages directly;
 * We have an alerting system, it will trigger alert action based on key log 
message, for example, send these key log (checkpoint timeout, the job failure) 
by SMS or IM tools, if we have more detailed information, we can quickly locate 
the problems; otherwise we need to view the log file again;

> Provide task manager's location information for checkpoint coordinator 
> specific log messages
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12683
>                 URL: https://issues.apache.org/jira/browse/FLINK-12683
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: vinoyang
>            Assignee: vinoyang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, the {{AcknowledgeCheckpoint}} does not contain the task manager's 
> location information. When a task's snapshot task sends an ack message to the 
> coordinator, we can only log this message:
> {code:java}
> Received late message for now expired checkpoint attempt 6035 from 
> ccd88d08bf82245f3466c9480fb5687a of job 775ef8ff0159b071da7804925bbd362f.
> {code}
> Sometimes we need to get this sub task's location information to do the 
> further debug work, e.g. stack trace dump. But, without the location 
> information, It will not help to quickly locate the problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to