Using mysql to catch hadoop logs

Owen O'Malley Mon, 23 Oct 2006 11:23:35 -0700

I setup a mysql database to catch Hadoop log events (at the warnlevel and above). I setup a database named HadoopLog with a table like:


mysql> show columns from Events;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| category | varchar(80)  | YES  |     | NULL    |       |
| t        | datetime     | YES  |     | NULL    |       |
| priority | varchar(12)  | YES  |     | NULL    |       |
| host     | varchar(100) | YES  |     | NULL    |       |
| message  | varchar(255) | YES  |     | NULL    |       |
+----------+--------------+------+-----+---------+-------+


set the number of max_connections=4000 on the mysql server

You'll also need to download the following GPL'ed software.

jdbcappender.jar from http://www.dankomannhaupt.de/projects/index.html

mysql-connector-java-3.0.17-ga-bin.jar from http://www.mysql.com/products/connector/j/

and put them into your hadoop/lib directory. Note that thejdbcappender is not the one built into Log4j, which has some reallybad limitations like not quoting single quotes.

If you change the root logger to include MySql and define MySql to bea new appender in log4j.properties like (after changing the hostname,database name, login, and password):

# Have all loggers in both servers and clients send to MySql as wellas the default.

log4j.rootLogger=${hadoop.root.logger},MySql

# Define a MySql appender that sends events to a mysql database
log4j.appender.MySql=org.apache.log4j.jdbcplus.JDBCAppender
log4j.appender.MySql.dbclass=com.mysql.jdbc.Driver
log4j.appender.MySql.threshold=WARN
log4j.appender.MySql.url=jdbc:mysql://mysqlhost.foobar.com/HadoopLog
log4j.appender.MySql.username=hadoop
log4j.appender.MySql.password=whatever

log4j.appender.MySql.sql=insert into Events (category, t, priority,host, message) values ('@CAT@', current_timestamp(), '@PRIO@',substring_index(user(),'@',-1), '@MSG@')

log4j.appender.MySql.layout=org.apache.log4j.PatternLayout

The advantage of having the events in a single database, is that youcan do queries to summarize the events. For example, using the sortexample on 4tb of data on 409 nodes, I ended up with the followingdistribution of events:

mysql> select priority, category, count(*) from Events group bypriority,category;

+----------+------------------------------------------+----------+
| priority | category                                 | count(*) |
+----------+------------------------------------------+----------+
| WARN     | org.apache.hadoop.dfs.DataNode           |        4 |
| WARN     | org.apache.hadoop.ipc.Server             |       35 |
| WARN     | org.apache.hadoop.mapred.JobTracker      |       30 |
| WARN     | org.apache.hadoop.mapred.TaskRunner      |     8256 |
| WARN     | org.apache.hadoop.mapred.TaskTracker     |        3 |
| WARN     | org.mortbay.http.HttpContext             |        1 |
| WARN     | org.mortbay.http.SocketListener          |   971317 |
| WARN     | org.mortbay.jetty.servlet.ServletHandler |      409 |
| WARN     | org.mortbay.util.ThreadPool              |     2659 |
+----------+------------------------------------------+----------+
9 rows in set (3.83 sec)

If you look at the details, the ThreadPool and SocketListenermessages are about jetty running out of threads for serving the mapoutputs. The 35 ipc.Server messages are rpc calls being droppedand so on. It lets me view the errors across the entire cluster. IfI was going to put in production, I'd probably leave the filter atWARN for org.apache and ERROR for org.mortbay. It would drasticallycut down the number of events and still catch the most interesting ones.

The advantage of the sql database is primarily in doing the adhocqueries. For example, I wanted to see how often each host was put inthe penalty box (for failing to transfer map outputs):

mysql> select src,count(*) as cnt from (select substring_index(substring_index(message," ",4)," ",-1) as src from Events wherecategory like '%.TaskRunner' and message like '%penalty box%') assubslt group by src order by cnt;

+----------+------+
| src      | cnt  |
+----------+------+
| node2227 |    1 |
| node2102 |    1 |
| node2084 |    1 |
| node2194 |    1 |
| node2124 |    1 |
| node2185 |    1 |
| node2400 |    1 |
| node2103 |    1 |
| node2318 |    1 |
| node2385 |    4 |
| node2296 |    5 |
| node2222 |    8 |
| node2408 |   31 |
| node2028 |  677 |
| node2003 | 1294 |
+----------+------+
15 rows in set (1.30 sec)

Which shows that things went badly on a couple of nodes, but ingeneral it was good. Looking at the distribution the other way didn'tprove very useful because all 409 nodes put a node in the penalty boxand the distribution went from 1 to 11 events per a node.

I also looked at the rpc timeouts to see which messages and hostswhere having problems:

mysql> select substring_index(host,".",1) as host,substring_index(message,"(", 1) as msg,count(*) as cnt from Events wherecategory='org.apache.hadoop.ipc.Server' group by host,msg order by cnt;

+----------+-------------------------+-----+
| host     | msg                     | cnt |
+----------+-------------------------+-----+
| node2194 | Call ping               |   1 |
| node2214 | handler output error    |   1 |
| node2319 | handler output error    |   1 |
| node2013 | handler output error    |   1 |
| node2386 | Call ping               |   1 |
| node2314 | handler output error    |   1 |
| node2091 | handler output error    |   1 |
| node2127 | handler output error    |   1 |
| node2311 | handler output error    |   1 |
| node2226 | Call ping               |   1 |
| node2034 | Call ping               |   1 |
| node2269 | handler output error    |   1 |
| node2003 | Call progress           |   2 |
| node2003 | Call getProtocolVersion |   2 |
| node2108 | Call ping               |   2 |
| node2028 | Call ping               |   5 |
| node2003 | Call ping               |  12 |
+----------+-------------------------+-----+
17 rows in set (1.16 sec)

Using mysql to log the important events gave me a convenient way tolook at the logs across the entire cluster and let me get a clearerview of what the system was having problems with.

I guess after writing all of this down, I should put in on Hadoop'swiki. *smile*


-- Owen

Using mysql to catch hadoop logs

Reply via email to