Hi All,

We have been using hadoop-0.17.1 in a 50 machine cluster. The configuration of the cluster is as follows:-

m/c 1: hadoop-d01 : NameNode/Secondary NameNode/ Jobtracker
rest: hadoop-d02 to hadoop-d50 : Datanodes and tasktrackers

We have been running map reduce jobs in the above mentioned clusters successfully for sometime. But suddenly all map reduce jobs have started failing. After failure of our own custom Map red jobs, we even tried executing the wordcount example which comes along with hadoop examples. Even that failed. The logs don't seem to reveal much in the context. Here are excerpts from the logs which appeared relevant to me.
NAMENODE: JOBTRACKER LOG
2008-07-17 00:30:00,856 INFO org.apache.hadoop.mapred.JobInProgress: Choosing rack-local task tip_200807090133_0048_m_0000442008-07-17 00:30:00,856 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000044_0' to tip tip_200807090133_0048_m_000044, for tracker 'tracker_hadoop-d34.search.aol.com:localhost.localdomain/127.0.0.1:2375'2008-07-17 00:30:00,877 INFO org.apache.hadoop.mapred.JobInProgress: Choosing rack-local task tip_200807090133_0048_m_0000452008-07-17 00:30:00,877 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000045_0' to tip tip_200807090133_0048_m_000045, for tracker 'tracker_hadoop-d33.search.aol.com:localhost.localdomain/127.0.0.1:3484'2008-07-17 00:30:00,890 INFO org.apache.hadoop.mapred.JobInProgress: Choosing rack-local task tip_200807090133_0048_m_0000462008-07-17 00:30:00,890 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000046_0' to tip tip_200807090133_0048_m_000046, for tracker 'tracker_hadoop-d10.search.aol.com:localhost.localdomain/127.0.0.1:7236'2008-07-17 00:30:00,903 INFO org.apache.hadoop.mapred.JobInProgress: Choosing rack-local task tip_200807090133_0048_m_0000472008-07-17 00:30:00,904 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000047_0' to tip tip_200807090133_0048_m_000047, for tracker 'tracker_hadoop-d11.search.aol.com:localhost.localdomain/127.0.0.1:64400'2008-07-17 00:30:00,909 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task tip_200807090133_0048_m_0000482008-07-17 00:30:00,909 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000048_0' to tip tip_200807090133_0048_m_000048, for tracker 'tracker_hadoop-d12.search.aol.com:localhost.localdomain/127.0.0.1:6953'2008-07-17 00:30:00,923 INFO org.apache.hadoop.mapred.JobInProgress: Choosing rack-local task tip_200807090133_0048_m_0000492008-07-17 00:30:00,923 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000049_0' to tip tip_200807090133_0048_m_000049, for tracker 'tracker_hadoop-d20.search.aol.com:localhost.localdomain/127.0.0.1:31257'2008-07-17 00:30:00,934 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807090133_0048_m_000000_0: java.io.IOException: Task process exitwith nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)2008-07-17 00:30:00,935 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_r_000000_0' to tip tip_200807090133_0048_r_000000, for tracker 'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'2008-07-17 00:30:01,232 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807090133_0048_m_000003_0: java.io.IOException: Task process exitwith nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,475 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807090133_0048_r_000000_0: java.io.IOException: Task process exitwith nonzero status of 1.
       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,476 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200807090133_0048_m_000000_0' from 'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179' 2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobInProgress: Choosing rack-local task tip_200807090133_0048_m_000000 2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200807090133_0048_m_000000_1' to tip tip_200807090133_0048_m_000000, for tracker 'tracker_hadoop-d42.search.aol.com:localhost.localdomain/127.0.0.1:1914' 2008-07-17 00:30:01,619 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807090133_0048_m_000008_0: java.io.IOException: Task process exitwith nonzero status of 1.
       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807090133_0048_m_000001_0: java.io.IOException: Task process exitwith nonzero status of 1.
       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807090133_0048_m_000002_0: java.io.IOException: Task process exitwith nonzero status of 1.
       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

When we examined the failed jobs from the web-UI every failed job listed about 13 blacklisted nodes. The numbers of the NON-blacklisted nodes are compiled from different failed jobs are
hadoop-d01, hadoop-d10
hadoop-d12, hadoop-d15, hadoop-d16, hadoop-d18, hadoop-d19
hadoop-d22, hadoop-d23, hadoop-d24, hadoop-d27, hadoop-d28
hadoop-d33
hadoop-d41, hadoop-d45
hadoop-d50

For all the other nodes which have been blacklisted by one or the other failing jobs, no entries are present in the logs.

For the remaining live machines, following are the different excerpts from logs.
Hadoop-d10-Tasktracker log

2008-07-17 01:22:12,044 WARN org.apache.hadoop.mapred.TaskTracker: 
getMapOutput(task_200807090133_0053_m_000085_0,8) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200807090133_0053/task_200807090133_0053_m_000085_0/output/file.out.index
 in any of the configured local directories
   at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
   at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2300)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
   at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
   at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
   at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
   at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
   at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
   at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
   at org.mortbay.http.HttpServer.service(HttpServer.java:954)
   at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
   at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
   at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
   at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
   at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Incidentally, we are writing log files into the hadoop-dfs continuously from some of our web servers. We had just added a couple of new web-servers recently (2 days back to be precise), which cuses much more data to be dumped into the hdfs. Strangely, the map reduce jobs started failing right after that (might be a coincidence, might be not).

Can anybody suggest why this is happening, or any remedy thereof.

thanks and regards

Pratyush Banerjee

Reply via email to