[jira] Issue Comment Edited: (HADOOP-6107) Have some log messages designed for machine parsing, either real-time or post-mortem

Steve Loughran (JIRA) Thu, 25 Jun 2009 06:01:37 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-6107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724076#action_12724076
 ]


Steve Loughran edited comment on HADOOP-6107 at 6/25/09 5:59 AM:
-----------------------------------------------------------------

as examples of the problem, some client side logs

{code}
  [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:41:10 INFO mapred.JobClient: Task Id : 
attempt_200906251314_0002_r_000001_0, Status : FAILED
     [java] Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:44:07 INFO mapred.JobClient: Task Id : 
attempt_200906251314_0002_m_000004_0, Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:44:11 INFO mapred.JobClient:  map 83% reduce 0%
     [java] 09/06/25 13:44:14 INFO mapred.JobClient:  map 100% reduce 0%
     [java] 09/06/25 13:49:23 INFO mapred.JobClient: Task Id : 
attempt_200906251314_0002_m_000005_0, Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:49:27 INFO mapred.JobClient:  map 83% reduce 0%
{code}

# bad spacing in the " Error reading task outputConnection refused" message. 
# not enough context as to why the connection was being refused: need to 
include the (hostname, port) details -which would change the message and break 
chukwa
# no stack trace in the connection refused message
# not enough context in the JobClient messages; if >1 job is running 
simultaneously, you cant determine what the map and reduce is referring to 
# The shuffle error doesn't actually say what the MAX_FAILED_UNIQUE_FETCHES 
value is. 

      was (Author: steve_l):
    as examples of the problem, some client side logs

{{code}
  [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:41:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:41:10 INFO mapred.JobClient: Task Id : 
attempt_200906251314_0002_r_000001_0, Status : FAILED
     [java] Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:41:10 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:44:07 INFO mapred.JobClient: Task Id : 
attempt_200906251314_0002_m_000004_0, Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:44:07 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:44:11 INFO mapred.JobClient:  map 83% reduce 0%
     [java] 09/06/25 13:44:14 INFO mapred.JobClient:  map 100% reduce 0%
     [java] 09/06/25 13:49:23 INFO mapred.JobClient: Task Id : 
attempt_200906251314_0002_m_000005_0, Status : FAILED
     [java] Too many fetch-failures
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:49:23 WARN mapred.JobClient: Error reading task 
outputConnection refused
     [java] 09/06/25 13:49:27 INFO mapred.JobClient:  map 83% reduce 0%
{code}

# bad spacing in the " Error reading task outputConnection refused" message. 
# not enough context as to why the connection was being refused: need to 
include the (hostname, port) details -which would change the message and break 
chukwa
# no stack trace in the connection refused message
# not enough context in the JobClient messages; if >1 job is running 
simultaneously, you cant determine what the map and reduce is referring to 
# The shuffle error doesn't actually say what the MAX_FAILED_UNIQUE_FETCHES 
value is. 
  
> Have some log messages designed for machine parsing, either real-time or 
> post-mortem
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6107
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6107
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 0.21.0
>            Reporter: Steve Loughran
>
> Many programs take the log output of bits of Hadoop, and try and parse it. 
> Some may also put their own back end behind commons-logging, to capture the 
> input without going via Log4J, so as to keep the output more machine-readable.
> These programs need log messages that
> # are easy to parse by a regexp or other simple string parse  (consider 
> quoting values, etc)
> # push out the full exception chain rather than stringify() bits of it
> # stay stable across versions
> # log the things the tools need to analyse: events, data volumes, errors
> For these logging tools, ease of parsing, retention of data and stability 
> over time take the edge over readability. In HADOOP-5073, Jiaqi Tan proposed 
> marking some of the existing log events as evolving towards stability. As 
> someone who regulary patches log messages to improve diagnostics, this 
> creates a conflict of interest. For me, good logs are ones that help people 
> debug their problems without anyone else helping, and if that means improving 
> the text, so be it. Tools like Chukwa have a different need. 
> What to do? Some options
>  # Have some messages that are designed purely for other programs to handle
>  # Have some logs specifically for machines, to which we log alongside the 
> human-centric messages
>  # Fix many of the common messages, then leave them alone.
>  # Mark log messages to be left alone (somehow)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-6107) Have some log messages designed for machine parsing, either real-time or post-mortem

Reply via email to