[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072844#comment-13072844
 ] 

Devaraj K commented on MAPREDUCE-2635:
--------------------------------------

Harsh, Sudharshan,

    This occurs irrespective of single/multiple nodes when the cluster capacity 
is less than the number of the maps of the parent job. 
In the mapreduce application given, the parent Job is spanning child jobs for 
each map input and waiting for those child jobs to complete. When the parent 
job map waiting for the child job to complete, it is updating the status also. 

When the cluster capacity is less than the number of the maps of the parent 
job, all the slots in the cluster occupies by the parent job mappers and these 
mappers internally spans the child jobs, these child jobs will be waiting for 
the map slots to get free and parent jobs will be waiting for the child jobs to 
complete. This will never happen and waiting forever.  
Here the problem is with the application logic, it needs to be corrected in the 
application itself. There is no problem with mapreduce respective to this.

If the cluster capacity is greater than the number of the maps of the parent 
job, tasks are failing and finally Job is failing.

This issue can be invalidated.

> Jobs hang indefinitely on failure.
> ----------------------------------
>
>                 Key: MAPREDUCE-2635
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2635
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker, task-controller, tasktracker
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Suse Linux cluster with 2 nodes. One running a 
> jobtracker, namenode, datanode, tasktracker. Other running tasktracker, 
> datanode.
>            Reporter: Sudharsan Sampath
>            Priority: Blocker
>
> Running the following example hangs the child job indefinitely.
> public class HaltCluster
> {
>   public static void main(String[] args) throws IOException
>   {
>     JobConf jobConf = new JobConf();
>     prepareConf(jobConf);
>     if (args != null && args.length > 0)
>     {
>       jobConf.set("callonceagain", args[0]);
>       jobConf.setMaxMapAttempts(1);
>       jobConf.setJobName("ParentJob");
>     }
>     JobClient.runJob(jobConf);
>   }
>   public static void prepareConf(JobConf jobConf)
>   {
>     jobConf.setJarByClass(HaltCluster.class);
>     jobConf.set("mapred.job.tracker", "<<jobtracker>>");
>     jobConf.set("fs.default.name", "<<hdfs>>");
>     MultipleInputs.addInputPath(jobConf, new Path("/ignore" + 
> System.currentTimeMillis()), MyInputFormat.class);
>     jobConf.setJobName("ChildJob");
>     jobConf.setMapperClass(MyMapper.class);
>     jobConf.setOutputFormat(NullOutputFormat.class);
>     jobConf.setNumReduceTasks(0);
>   }
> }
> public class MyMapper implements Mapper<IntWritable, Text, NullWritable, 
> NullWritable>
> {
>   JobConf myConf = null;
>   @Override
>   public void map(IntWritable arg0, Text arg1, OutputCollector<NullWritable, 
> NullWritable> arg2, Reporter arg3) throws IOException
>   {
>     if (myConf != null && "true".equals(myConf.get("callonceagain")))
>     {
>       startBackGroundReporting(arg3);
>       HaltCluster.main(new String[] {});
>     }
>     throw new RuntimeException("Throwing exception");
>   }
>   private void startBackGroundReporting(final Reporter arg3)
>   {
>     Thread t = new Thread()
>     {
>       @Override
>       public void run()
>       {
>         while (true)
>         {
>           arg3.setStatus("Reporting to be alive at " + 
> System.currentTimeMillis());
>         }
>       }
>     };
>     t.setDaemon(true);
>     t.start();
>   }
>   @Override
>   public void configure(JobConf arg0)
>   {
>     myConf = arg0;
>   }
>   @Override
>   public void close() throws IOException
>   {
>     // TODO Auto-generated method stub
>   }
> }
> run using the following command
> java -cp <<classpath>> HaltCluster true
> But if only one job is triggered as java -cp <<classpath>> HaltCluster
> it fails to max number of attempts and quits as expected.
> Also, when the jobs hang, running the child job once again, makes it come out 
> of deadlock and completes the three jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to