Map reduce jobs started failing suddenly.....

Pratyush Banerjee Thu, 17 Jul 2008 00:45:28 -0700

Hi All,

We have been using hadoop-0.17.1 in a 50 machine cluster. Theconfiguration of the cluster is as follows:-


m/c 1: hadoop-d01 : NameNode/Secondary NameNode/ Jobtracker
rest: hadoop-d02 to hadoop-d50 : Datanodes and tasktrackers

We have been running map reduce jobs in the above mentioned clusterssuccessfully for sometime. But suddenly all map reduce jobs have startedfailing. After failure of our own custom Map red jobs, we even triedexecuting the wordcount example which comes along with hadoop examples.Even that failed.The logs don't seem to reveal much in the context. Here are excerptsfrom the logs which appeared relevant to me.

NAMENODE: JOBTRACKER LOG

2008-07-17 00:30:00,856 INFO org.apache.hadoop.mapred.JobInProgress:Choosing rack-local task tip_200807090133_0048_m_0000442008-07-1700:30:00,856 INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_m_000044_0' to tiptip_200807090133_0048_m_000044, for tracker'tracker_hadoop-d34.search.aol.com:localhost.localdomain/127.0.0.1:2375'2008-07-1700:30:00,877 INFO org.apache.hadoop.mapred.JobInProgress: Choosingrack-local task tip_200807090133_0048_m_0000452008-07-17 00:30:00,877INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_m_000045_0' to tiptip_200807090133_0048_m_000045, for tracker'tracker_hadoop-d33.search.aol.com:localhost.localdomain/127.0.0.1:3484'2008-07-1700:30:00,890 INFO org.apache.hadoop.mapred.JobInProgress: Choosingrack-local task tip_200807090133_0048_m_0000462008-07-17 00:30:00,890INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_m_000046_0' to tiptip_200807090133_0048_m_000046, for tracker'tracker_hadoop-d10.search.aol.com:localhost.localdomain/127.0.0.1:7236'2008-07-1700:30:00,903 INFO org.apache.hadoop.mapred.JobInProgress: Choosingrack-local task tip_200807090133_0048_m_0000472008-07-17 00:30:00,904INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_m_000047_0' to tiptip_200807090133_0048_m_000047, for tracker'tracker_hadoop-d11.search.aol.com:localhost.localdomain/127.0.0.1:64400'2008-07-1700:30:00,909 INFO org.apache.hadoop.mapred.JobInProgress: Choosingdata-local task tip_200807090133_0048_m_0000482008-07-17 00:30:00,909INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_m_000048_0' to tiptip_200807090133_0048_m_000048, for tracker'tracker_hadoop-d12.search.aol.com:localhost.localdomain/127.0.0.1:6953'2008-07-1700:30:00,923 INFO org.apache.hadoop.mapred.JobInProgress: Choosingrack-local task tip_200807090133_0048_m_0000492008-07-17 00:30:00,923INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_m_000049_0' to tiptip_200807090133_0048_m_000049, for tracker'tracker_hadoop-d20.search.aol.com:localhost.localdomain/127.0.0.1:31257'2008-07-1700:30:00,934 INFO org.apache.hadoop.mapred.TaskInProgress: Error fromtask_200807090133_0048_m_000000_0: java.io.IOException: Task processexitwith nonzero status of 1. atorg.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)atorg.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)2008-07-1700:30:00,935 INFO org.apache.hadoop.mapred.JobTracker: Adding task'task_200807090133_0048_r_000000_0' to tiptip_200807090133_0048_r_000000, for tracker'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'2008-07-1700:30:01,232 INFO org.apache.hadoop.mapred.TaskInProgress: Error fromtask_200807090133_0048_m_000003_0: java.io.IOException: Task processexitwith nonzero status of 1. atorg.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)

       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,475 INFO org.apache.hadoop.mapred.TaskInProgress:Error from task_200807090133_0048_r_000000_0: java.io.IOException: Taskprocess exitwith nonzero status of 1.

       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,476 INFO org.apache.hadoop.mapred.JobTracker:Removed completed task 'task_200807090133_0048_m_000000_0' from'tracker_hadoop-d17.search.aol.com:localhost.localdomain/127.0.0.1:6179'2008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobInProgress:Choosing rack-local task tip_200807090133_0048_m_0000002008-07-17 00:30:01,516 INFO org.apache.hadoop.mapred.JobTracker: Addingtask 'task_200807090133_0048_m_000000_1' to tiptip_200807090133_0048_m_000000, for tracker'tracker_hadoop-d42.search.aol.com:localhost.localdomain/127.0.0.1:1914'2008-07-17 00:30:01,619 INFO org.apache.hadoop.mapred.TaskInProgress:Error from task_200807090133_0048_m_000008_0: java.io.IOException: Taskprocess exitwith nonzero status of 1.

       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress:Error from task_200807090133_0048_m_000001_0: java.io.IOException: Taskprocess exitwith nonzero status of 1.

       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

2008-07-17 00:30:01,673 INFO org.apache.hadoop.mapred.TaskInProgress:Error from task_200807090133_0048_m_000002_0: java.io.IOException: Taskprocess exitwith nonzero status of 1.

       at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479)
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)

When we examined the failed jobs from the web-UI every failed job listedabout 13 blacklisted nodes. The numbers of the NON-blacklisted nodes arecompiled from different failed jobs are

hadoop-d01, hadoop-d10
hadoop-d12, hadoop-d15, hadoop-d16, hadoop-d18, hadoop-d19
hadoop-d22, hadoop-d23, hadoop-d24, hadoop-d27, hadoop-d28
hadoop-d33
hadoop-d41, hadoop-d45
hadoop-d50

For all the other nodes which have been blacklisted by one or the otherfailing jobs, no entries are present in the logs.

For the remaining live machines, following are the different excerptsfrom logs.

Hadoop-d10-Tasktracker log

2008-07-17 01:22:12,044 WARN org.apache.hadoop.mapred.TaskTracker: 
getMapOutput(task_200807090133_0053_m_000085_0,8) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
taskTracker/jobcache/job_200807090133_0053/task_200807090133_0053_m_000085_0/output/file.out.index
 in any of the configured local directories
   at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
   at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
   at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2300)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
   at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
   at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
   at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
   at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
   at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
   at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
   at org.mortbay.http.HttpServer.service(HttpServer.java:954)
   at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
   at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
   at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
   at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
   at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Incidentally, we are writing log files into the hadoop-dfs continuouslyfrom some of our web servers. We had just added a couple of newweb-servers recently (2 days back to be precise), which cuses much moredata to be dumped into the hdfs. Strangely, the map reduce jobs startedfailing right after that (might be a coincidence, might be not).


Can anybody suggest why this is happening, or any remedy thereof.

thanks and regards

Pratyush Banerjee

Map reduce jobs started failing suddenly.....

Reply via email to