[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072942#comment-13072942
 ] 

Sudharsan Sampath commented on MAPREDUCE-2635:
----------------------------------------------

Hi Devaraj,

You are right in saying the jobs will hang indefinitely due to slot
unavailability with cyclic jobs.

But the issue here is different. The issue occurs even on clusters with
higher slot capacity. The set up I have spans 2 slaves with 20 map slots in
each.

Another update is that while the jobs on the initial submit are hanging, if
I trigger this set another time all the four jobs complete
immediately(actually fail as I throw an explicit exception)

Hence, my opinion is that we should keep this open till we get to know the
real reason.

Thanks
Sudharsan S




> Jobs hang indefinitely on failure.
> ----------------------------------
>
>                 Key: MAPREDUCE-2635
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2635
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker, task-controller, tasktracker
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Suse Linux cluster with 2 nodes. One running a 
> jobtracker, namenode, datanode, tasktracker. Other running tasktracker, 
> datanode.
>            Reporter: Sudharsan Sampath
>            Priority: Blocker
>
> Running the following example hangs the child job indefinitely.
> public class HaltCluster
> {
>   public static void main(String[] args) throws IOException
>   {
>     JobConf jobConf = new JobConf();
>     prepareConf(jobConf);
>     if (args != null && args.length > 0)
>     {
>       jobConf.set("callonceagain", args[0]);
>       jobConf.setMaxMapAttempts(1);
>       jobConf.setJobName("ParentJob");
>     }
>     JobClient.runJob(jobConf);
>   }
>   public static void prepareConf(JobConf jobConf)
>   {
>     jobConf.setJarByClass(HaltCluster.class);
>     jobConf.set("mapred.job.tracker", "<<jobtracker>>");
>     jobConf.set("fs.default.name", "<<hdfs>>");
>     MultipleInputs.addInputPath(jobConf, new Path("/ignore" + 
> System.currentTimeMillis()), MyInputFormat.class);
>     jobConf.setJobName("ChildJob");
>     jobConf.setMapperClass(MyMapper.class);
>     jobConf.setOutputFormat(NullOutputFormat.class);
>     jobConf.setNumReduceTasks(0);
>   }
> }
> public class MyMapper implements Mapper<IntWritable, Text, NullWritable, 
> NullWritable>
> {
>   JobConf myConf = null;
>   @Override
>   public void map(IntWritable arg0, Text arg1, OutputCollector<NullWritable, 
> NullWritable> arg2, Reporter arg3) throws IOException
>   {
>     if (myConf != null && "true".equals(myConf.get("callonceagain")))
>     {
>       startBackGroundReporting(arg3);
>       HaltCluster.main(new String[] {});
>     }
>     throw new RuntimeException("Throwing exception");
>   }
>   private void startBackGroundReporting(final Reporter arg3)
>   {
>     Thread t = new Thread()
>     {
>       @Override
>       public void run()
>       {
>         while (true)
>         {
>           arg3.setStatus("Reporting to be alive at " + 
> System.currentTimeMillis());
>         }
>       }
>     };
>     t.setDaemon(true);
>     t.start();
>   }
>   @Override
>   public void configure(JobConf arg0)
>   {
>     myConf = arg0;
>   }
>   @Override
>   public void close() throws IOException
>   {
>     // TODO Auto-generated method stub
>   }
> }
> run using the following command
> java -cp <<classpath>> HaltCluster true
> But if only one job is triggered as java -cp <<classpath>> HaltCluster
> it fails to max number of attempts and quits as expected.
> Also, when the jobs hang, running the child job once again, makes it come out 
> of deadlock and completes the three jobs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to