[ 
https://issues.apache.org/jira/browse/HADOOP-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Zeyliger updated HADOOP-2914:
------------------------------------

    Attachment: HADOOP-2914-v1-full.patch
                HADOOP-2914-v1-since-4041.patch

I set out to get DistributedCache to work on local job runner --- which wasn't 
too tricky --- but I ended up refactoring the DistributedCache code quite a 
bit, which has made this patch large and perhaps unfriendly.

DistributedCache code is used in three places:
# In user code, to (1) configure files to be cached and (2) retrieve the URIs 
of those files at runtime,
# In JobClient, to record some metadata information about the files desired in 
user code,
# And in TaskTracker/TaskRunner, to (1) maintain the cache, and (2) configure 
the cache per task.

Most of the code for all of these uses was in public static methods in 
DistributedCache.java, though some pretty complicated logic about the 
DistributedCache was also in TaskTracker.java and TaskRunner.java.  This made 
it tricky to tease out what the sacrosanct public APIs were.  My interpretation 
is that the methods described in the documentation 
(http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache)
 are public APIs, and I have left those, and a few others, in tact.  I 
separated out the other logic into two other classes, so that then I could 
avoid duplication between TaskRunner and LocalJobRunner.

The current patch depends on HADOOP-4041, so I've attached two patches: one for 
Hudson, and another if you don't want to revisit the intersection with 4041 
(which is largely uninteresting: either way code moves out of TaskRunner into 
DistributedCacheHandle).

I've added some tests.  TestDistributedCache has become 
TestDistributedCacheManager, and there's a new test in there.  
TestMRWithDistributedCache tests against both local and MiniMRClusters.  I've 
also tested using streaming, with commands like:
{noformat}
bin/hadoop jar build/contrib/streaming/hadoop-0.21.0-dev-streaming.jar \
  -files /etc/passwd -input /dev/null -output /tmp/output1 -mapper 'sh -c "test 
! -z $mapred_cache_localFiles"'
bin/hadoop jar build/contrib/streaming/hadoop-0.21.0-dev-streaming.jar \
  -jt local -files /etc/passwd -input /dev/null -output /tmp/output2 -mapper 
'sh -c "test ! -z $mapred_cache_localFiles"'
{noformat}
Is there a place where tests that use streaming to check other functionality 
could be checked in?

I wanted to stop somewhere and send this out, but I can think of several 
potential future JIRAs:

* The DistributedCache is in core/, but it only makes sense with mapred, so it 
probably should be relocated to mapred.
* There's more work to be done to separate out the public interfaces from the 
private ones.  The timestamp handling that's done by JobClient should really be 
done by something within the filecache package, for example.  Much of the 
annoyance here stems from the haphazard ways in which Hadoop jobs serialize 
some configuration data to the configuration file.  DistributedCache uses, I 
believe, 6 configuration keys, just to store ("file", "archive", 
"file+classpath", "archive+classpath", "filetimestamp", "archive+timestamp").
* Speaking of configuration, DistributedCache will not likely work for files 
with a comma in their path, though perhaps URI encoding saves us there.
* I haven't touched the DistributedCacheManager code except to move it there, 
but I suspect it could be significantly simplified now that it contains a 
Configuration object.
* It's my belief that SVN r696957 (HADOOP-249) turned off the symlink feature 
and that it hasn't worked since then.  That said, I haven't yet written the 
test that would confirm this.

Looking forward to your feedback. -- Philip


> extend DistributedCache to work locally (LocalJobRunner)
> --------------------------------------------------------
>
>                 Key: HADOOP-2914
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2914
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: sam rash
>            Priority: Minor
>         Attachments: HADOOP-2914-v1-full.patch, 
> HADOOP-2914-v1-since-4041.patch
>
>
> The DistributedCache does not work locally when using the outlined recipe at 
> http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html
>  
> Ideally, LocalJobRunner would take care of populating the JobConf and copying 
> remote files to the local file sytem (http, assume hdfs = default fs = local 
> fs when doing local development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to