(Bcc CDH alias) > Please don't cross-post, CDH questions should go to their user lists. > Was this CDH specific?
Did the job show up as failed on the jobtracker webui? If yes, can you grep a jobtracker log to see something like 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress: TaskInProgress task_201101040441_3330 49_r_000004 has failed 4 times. 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Aborting job job_201101040441_333049 2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Killing job 'job_201101040441_333049' which tells you which task failure caused the job to fail. Then you can look at the userlog of those task attempts to see why they failed. Ideally this info should show up on the webui. On the other hands, if the job just hang for hours, there's probably a bug on the framework. Koji On 1/31/11 9:36 PM, "Arun C Murthy" <a...@yahoo-inc.com> wrote: Please don't cross-post, CDH questions should go to their user lists. On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote: Hi! I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of work, one of the worker nodes just simply died (I cannot connect to the instance either). That node also appears on the dfshealth as dead node. Until now everything is normal. Unfortunately the job it was running didn't survived. The cluster it had 8 worker nodes, each with 4 mappers and 2 reducers. The job in cause it had ~1200 map tasks and 10 reduce tasks. One of the node died and I see around 31 failed attempts in the jobtracker log. The log is very similar with the one somebody placed it here: http://pastie.org/pastes/1270614 Some of the attempts (but not all!) has been retried and I saw at least two of them which finally is getting in a successful state. The following two lines appears several times in my jobtracker log: 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Running list for reducers missing!! Job details are missing. 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Failed cache for reducers missing!! Job details are missing. These pair of log lines could be the signals that the job couldn't be finished by re-scheduling the failed attempts. Nothing special I have seen in namenode logs. Of course I rerun the failed job which finished successfully. But my problem is that I would like to understand the failover conditions. What could be lost, which part of the hadoop is not fault tolerant in this sense that it happens to see those warnings mentioned earlier. Is there a chance to control such kind of situations? I am using CDH3b3 version, so it is a developing version of Hadoop. Somebody knows about a special bug or fix which in the near future can solve the problem? Regards Tibor Kiss