logging improvements for Hadoop
-------------------------------

         Key: HADOOP-211
         URL: http://issues.apache.org/jira/browse/HADOOP-211
     Project: Hadoop
        Type: Improvement

    Versions: 0.2    
    Reporter: Sameer Paranjpye
 Assigned to: Sameer Paranjpye 
    Priority: Minor
     Fix For: 0.3


Here's a proposal for some impovements to the way Hadoop does logging. It 
advocates 3 
broad changes to the way logging is currently done, these being:

- The use of a uniform logging format by all Hadoop subsystems
- The use of Apache commons logging as a facade above an underlying logging 
framework
- The use of Log4J as the underlying logging framework instead of 
java.util.logging

This is largely polishing work, but it seems like it would make log analysis 
and debugging
easier in the short term. In the long term, it would future proof logging to 
the extent of
allowing the logging framework used to change while requiring minimal code 
change. The 
propos changes are motivated by the following requirements which we think 
Hadoops 
logging should meet:

- Hadoops logs should be amenable to analysis by tools like grep, sed, awk etc.
- Log entries should be clearly annotated with a timestamp and a logging level
- Log entries should be traceable to the subsystem from which they originated
- The logging implementation should allow log entries to be annotated with 
source code 
location information like classname, methodname, file and line number, without 
requiring
code changes
- It should be possible to change the logging implementation used without 
having to change
thousands of lines of code
- The mapping of loggers to destinations (files, directories, servers etc.) 
should be 
specified and modifiable via configuration


Uniform logging format:

All Hadoop logs should have the following structure.

<Header>\n
<LogEntry>\n [<Exception>\n]
.
.
.

where the header line specifies the format of each log entry. The header line 
has the format:
'# <Fieldname> <Fieldname>...\n'. 

The default format of each log entry is: '# Timestamp Level LoggerName 
Message', where:

- Timestamp is a date and time in the format MM/DD/YYYY:HH:MM:SS
- Level is the logging level (FATAL, WARN, DEBUG, TRACE, etc.)
- LoggerName is the short name of the logging subsystem from which the message 
originated e.g.
fs.FSNamesystem, dfs.Datanode etc.
- Message is the log message produced


Why Apache commons logging and Log4J?

Apache commons logging is a facade meant to be used as a wrapper around an 
underlying logging
implementation. Bridges from Apache commons logging to popular logging 
implementations 
(Java logging, Log4J, Avalon etc.) are implemented and available as part of the 
commons logging
distribution. Implementing a bridge to an unsupported implementation is fairly 
striaghtforward
and involves the implementation of subclasses of the commons logging LogFactory 
and Logger 
classes. Using Apache commons logging and making all logging calls through it 
enables us to
move to a different logging implementation by simply changing configuration in 
the best case.
Even otherwise, it incurs minimal code churn overhead.

Log4J offers a few benefits over java.util.logging that make it a more 
desirable choice for the
logging back end.

- Configuration Flexibility: The mapping of loggers to destinations (files, 
sockets etc.)
can be completely specified in configuration. It is possible to do this with 
Java logging as
well, however, configuration is a lot more restrictive. For instance, with Java 
logging all 
log files must have names derived from the same pattern. For the namenode, log 
files could 
be named with the pattern "%h/namenode%u.log" which would put log files in the 
user.home
directory with names like namenode0.log etc. With Log4J it would be possible to 
configure
the namenode to emit log files with different names, say heartbeats.log, 
namespace.log,
clients.log etc. Configuration variables in Log4J can also have the values of 
system 
properties embedded in them.

- Takes wrappers into account: Log4J takes into account the possibility that an 
application
may be invoking it via a wrapper, such as Apache commons logging. This is 
important because
logging event objects must be able to infer the context of the logging call 
such as classname,
methodname etc. Inferring context is a relatively expensive operation that 
involves creating
an exception and examining the stack trace to find the frame just before the 
first frame 
of the logging framework. It is therefore done lazily only when this 
information actually 
needs to be logged. Log4J can be instructed to look for the frame corresponding 
to the wrapper
class, Java logging cannot. In the case of Java logging this means that a) the 
bridge from 
Apache commons logging is responsible for inferring the calling context and 
setting it in the 
logging event and b) this inference has to be done on every logging call 
regardless of whether
or not it is needed.

- More handy features: Log4J has some handy features that Java logging doesn't. 
A couple
of examples of these:
a) Date based rolling of log files 
b) Format control through configuration. Log4J has a PatternLayout class that 
can be 
configured to generate logs with a user specified pattern. The logging format 
described
above can be described as "%d{MM/dd/yyyy:HH:mm:SS} %c{2} %p %m". The format 
specifiers
indicate that each log line should have the date and time followed by the 
logger name followed
by the logging level or priority followed by the application generated message.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to