[jira] Updated: (HADOOP-4041) IsolationRunner does not work as documented

Tom White (JIRA) Tue, 02 Dec 2008 09:37:14 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tom White updated HADOOP-4041:
------------------------------

    Attachment: hadoop-4041.patch

I've been trying out IsolationRunner and I think it is broken in several ways.

1. It doesn't construct a valid classpath due to changes in the directory 
layout as Owen mentioned.
2. For reduce tasks, it doesn't fill in the missing map outputs (throws a 
DiskErrorException).
3. For reduce tasks, even if map outputs are there the reduce task never exits 
as Yuri observed.

I've produced a test for IsolationRunner which tests 2. and 3., and a fix for 
1. and 2. (So the test just hangs at the moment, which exposes 3.) See the 
attached patch.

I'm not sure how to fix 3. The task spawns MapOutputCopiers which go into a 
wait state forever. Should IsolationRunner even be copying map outputs from map 
nodes, or should it just be using the files it has locally? Any ideas how to 
fix this?

As a future improvement it would be better if IsolationRunner could share code 
with TaskRunner so its behaviour is as faithful as possible. 

> IsolationRunner does not work as documented
> -------------------------------------------
>
>                 Key: HADOOP-4041
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4041
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: documentation, mapred
>    Affects Versions: 0.18.0
>            Reporter: Yuri Pradkin
>         Attachments: hadoop-4041.patch
>
>
> IsolationRunner does not work as documented in the tutorial.
> The tutorial  says "To use the IsolationRunner, first set 
> keep.failed.tasks.files to true (also see keep.tasks.files.pattern)."
> Should be:
>   keep.failed.task.files (not tasks)
> After the above was set (quoted from my message on hadoop-core):
> > After the task
> > hung, I failed it via the web interface.  Then I went to the node that was
> > running this task
> >
> >   $ cd ...local/taskTracker/jobcache/job_200808071645_0001/work
> > (this path is already different from the tutorial's)
> >
> >   $ hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
> > Exception in thread "main" java.lang.NullPointerException
> >         at
> > org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:164)
> >
> > Looking at IsolationRunner code, I see this:
> >
> >     164     File workDirName = new File(lDirAlloc.getLocalPathToRead(
> >     165                                   TaskTracker.getJobCacheSubdir()
> >     166                                   + Path.SEPARATOR + 
> > taskId.getJobID() 
> >     167                                   + Path.SEPARATOR + taskId
> >     168                                   + Path.SEPARATOR + "work",
> >     169                                   conf). toString());
> >
> > I.e. it assumes there is supposed to be a taskID subdirectory under the job
> > dir, but:
> >  $ pwd
> >  ...mapred/local/taskTracker/jobcache/job_200808071645_0001
> >  $ ls
> >  jars  job.xml  work
> >
> > -- it's not there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4041) IsolationRunner does not work as documented

Reply via email to