+1  on Richard's comments
+1 on Dick's (MR should symlink the files that belong to the task into the task running directory)
-10 on ../work

On Nov 17, 2006, at 4:25 PM, Richard Kasperski (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-673? page=comments#action_12450936 ]

Richard Kasperski commented on HADOOP-673:
------------------------------------------

I think that it is very important that there be a way for an application to run in a sandbox <local>/jobcache/<jobid>/<taskid> that contains the contents of the jar file and is the current working directory. How one accomplishes this is an implementation detail. If the only way to do it is to unjar the archive more than once then I guess that would have to be the solution. This saves the potential for a lot of grief. No shared files are ever modified because they aren't actually shared. This causes more unpacking of jars but I don't really see that as a problem. How the jar's are copied to a node and the subsequent reuse of the jar is important.

That is the sandbox. Then there is the shared sandbox which is also two different instance to memory map a file and pay a single cost. This is best handled by either softlinks or hardlinks under unix/linux.

Why do I think these are important models? Most of the programs that I write and that I use run out of the current directory and expect all of the their configuration files/resources can be read from there. For programs that have more sophisticated models of deployment the models above are still ok. For the simpler programs the proposed external repository doesn't work.

OTOH I can always run a script that will hard link the files from the directory where the jar to my current working directory. I just don't believe that this should be forced on the users. Having the users do system'ish things is potentially dangerous.

Even more restrictive then the above sandbox would be one in which the application is run chroot'd. That way there is no way it could muck with anything system like on the nodes. This is an important consideration when one lets arbitrary programs to be run.

the task execution environment should have a current working directory that is task specific ---------------------------------------------------------------------- ----------------------

                Key: HADOOP-673
                URL: http://issues.apache.org/jira/browse/HADOOP-673
            Project: Hadoop
         Issue Type: Bug
         Components: mapred
   Affects Versions: 0.7.2
           Reporter: Owen O'Malley
        Assigned To: Mahadev konar
            Fix For: 0.9.0


The tasks should be run in a work directory that is specific to a single task. In particular, I'd suggest using the <local>/jobcache/<jobid>/<taskid> as the current working directory.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



Reply via email to