Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Koji Noguchi Tue, 01 Feb 2011 08:34:41 -0800

(Bcc CDH alias)

> Please don't cross-post, CDH questions should go to their user lists.
>
Was this CDH specific?


Did the job show up as failed on the jobtracker webui?
If yes, can you grep a jobtracker log to see something like

2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress: 
TaskInProgress task_201101040441_3330
49_r_000004 has failed 4 times.
2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Aborting 
job job_201101040441_333049
2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Killing 
job 'job_201101040441_333049'

which tells you which task failure caused the job to fail.
Then you can look at the userlog of those task attempts to see why they failed.

Ideally this info should show up on the webui.

On the other hands, if the job just hang for hours, there's probably a bug on 
the framework.

Koji

On 1/31/11 9:36 PM, "Arun C Murthy" <a...@yahoo-inc.com> wrote:

Please don't cross-post, CDH questions should go to their user lists.

On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:

Hi!

I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of 
work, one of the worker nodes just simply died (I cannot connect to the 
instance either). That node also appears on the dfshealth as dead node.
 Until now everything is normal.

Unfortunately the job it was running didn't survived. The cluster it had 8 
worker nodes, each with 4 mappers and 2 reducers. The job in cause it had ~1200 
map tasks and 10 reduce tasks.
 One of the node died and I see around 31 failed attempts in the jobtracker 
log.  The log is very similar with the one  somebody placed it here: 
http://pastie.org/pastes/1270614

Some of the attempts (but not all!) has been retried and I saw at least two of 
them which finally is getting in a successful state.
The following two lines appears several times in my jobtracker log:
2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Running 
list for reducers missing!! Job details are missing.
 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Failed 
cache for reducers missing!! Job details are missing.

These pair of log lines could be the signals that the job couldn't be finished 
by re-scheduling the failed attempts.
 Nothing special I have seen in namenode logs.

Of course I rerun the failed job which finished successfully. But my problem is 
that I would like to understand the failover conditions. What could be lost, 
which part of the hadoop is not fault tolerant in this sense that it happens to 
see those warnings mentioned earlier. Is there a chance to control such kind of 
situations?

I am using CDH3b3 version, so it is a developing version of Hadoop.
Somebody knows about a special bug or fix which in the near future can solve 
the problem?

Regards
Tibor Kiss

Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

Reply via email to