[ 
https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737621#action_12737621
 ] 

Philip Zeyliger commented on MAPREDUCE-476:
-------------------------------------------

Hi Vinod,

Thanks for the ping; got distracted by other things.  And thanks again for the 
detailed
review.  My responses are below.  I've generated a patch that shows the 
differences
between v2 and v4, and also the patch, in a state where it still depends on 
MAPREDUCE-711. 
is there anything blocking MAPREDUCE-711 that prevents it from being committed?

Also, sorry about the multiple uploads here.  I had a very clever bug in there
(caused by not thinking enough while resolving a merge conflict) that deleted
the current working directory, recursively, in one of the tests.  (TaskRunner
is hard-coded to delete current working directory, which is ok, since it's
typically a child process; not ok for LocalJobRunner.)

I've run the "relevant" tests; the full tests take a while, so I'm running those
in the background.

{quote}
$for i in TestMRWithDistributedCache TestMiniMRLocalFS TestMiniMRDFSCaching 
TestTrackerDistributedCacheManager; do; ant test -Dtestcase=$i > test-out-$i && 
echo "$i good" || echo "$i bad"; done
TestMRWithDistributedCache good
TestMiniMRLocalFS good
TestMiniMRDFSCaching good
TestTrackerDistributedCacheManager good
{quote}

bq. There is quite a bit of refactoring in this patch, though I find it really 
useful.

Yep.  Having DistributedCache work locally is easy if you refactor the code
a bit, so that's how I went at it.


bq. Please make sure that in the newly added code, lines aren't longer than 80 
characters. For e.g, see DistributedCacheManager.newTaskHandle() method.

A handful of "git diff foo..bar | egrep "^\+\+\+|^\+ .{80}" has done the trick,
I think.  The tricky bit is always fixing only the lines I've changed,
and not all the lines in a given file, to preserve history and keep
reviewing sane.

bq. Just a thought, can the classes be better renamed to reflect their usage, 
something like TrackerDistributedCacheManager and TaskDistributeCacheManager?

I like those names better; thanks.  Changed.

bq. DistributedCacheManager and DistributedCacheHandle: Explicitly state in 
javadoc that it is not a public interface 

Done.

bq. This class should also have the variable number argument getLocalCache() 
methods so that the corresponding methods in DistributedCache can be 
deprecated. Also, each method in DistributedCache should call the correponding 
method in DistributedCacheManager class.

Don't think I agree here.  We can deprecate the getLocalCache methods 
in DistributedCache right away.  They delegate to each other, and one of them
delegates to TrackerDistributedCacheManager.  Ideally, I'd remove these
altogether --- Hadoop internally does not use these methods with
this patch, and there's no sensible reason why someone else would,
but since it's public, it's getting deprecated.  But it's not
being deprecated with a pointer to use something else; it's getting
deprecated so that you don't use it at all.


bq. DistributedCacheHandle CacheFile.makeCacheFiles()
bq. isClassPath can be renamed to shouldBePutInClasspath
Renamed to shouldBeAddedToClasspath.

bq. paths can be renamed to pathsToPutInClasspath.
Renamed to pathsToBeAddedToClasspath

bq. Use .equals method at +150 if (cacheFile.type == CacheFile.FileType.ARCHIVE)

I believe that technically it doesn't matter.
The JDK implementation of equals() on java.lang.Enum is final, 
and hardcoded to "this==other".  This is the only thing that makes
sense, since there's only ever one instance of a given Enum.
I took an inaccurate look at the code base, and == is the more
common option.
{quote}
  # Inaccurate!  Not a static analysis!  Not even close! ;)
  [1]doorstop:hadoop-mapreduce(140142)$ack "\([a-zA-Z]*\.[a-zA-Z]*\.equals" src 
| wc -l
  11
  [0]doorstop:hadoop-mapreduce(140143)$ack "\([a-zA-Z]*\.[a-zA-Z]* ==" src | wc 
-l
  127
{quote}
If you feel strongly about this, happy to change it, but I think == is
more consistent.

bq. makeCacheFiles: boolean isArchive -> FileType fileType

done.

bq. I think it would be cleaner to return target instead of passing it as an 
argument.

Done.

bq. makeCacheFiles() method should be documented

Done.

bq. setup() This method is really useful, avoids a lot of code duplication!

Ok.

bq. Leave localTaskFile writing business back in TaskRunner itself. I think It 
is the task's responsiblity, not the DistributeCacheHandle's

Good call; done.

bq. cacheSubdir can better be an argument to setup() method instead of passing 
it to the constructor.

Good idea; done.

bq. getClassPaths() : Document that it has to be called and useful only when is 
already invoked.

Done.  I've made it throw an exception if it's called erroneously, since I could
see that causing trouble for developers.

bq. TaskTracker.initialize() A new DistributedCacheManager is created every 
time, so old files will not be deleted by the subsequent purgeCache. Either we 
should have only one cacheManager for a TaskTracker or 
DistributedCacheManager.cachedArchives should be static. The same problem 
exists with the deprecated purgeCache() method in DistributedCache.java

If my understanding is correct, 
in the course of normal operations, TaskTracker.initialize() is only called 
once, by
TaskTracker.run().  The comment there says:
{quote}
  /**
   * The server retry loop.  
   * This while-loop attempts to connect to the JobTracker.  It only 
   * loops when the old TaskTracker has gone bad (its state is
   * stale somehow) and we need to reinitialize everything.
   */
{quote}
At startup time (and whenever the TaskTracker decides to lose all of its state
and reset itself, which is essentially the same), the TaskTracker initializes
the TrackerDistributedCacheManager object.  This is also the only time it's
allowed to "clear" the cache, since there are for certain no tasks running
at initialization time that depend on those.

Part of the whole refactoring here is to avoid static members and methods
as much as possible, because they make it quite tricky to test, deprecate,
and generally play nice.

bq.  JobClient.java What happens to the code in here? Should this be refactored 
too/ are we going to do something about it?

Good question :)  I do think that JobClient should be using some methods in the
filecache package, instead of hard-coding a lot of logic.  That said, I chose to
stop somewhere to avoid making this patch even harder to review.  I think it's
ripe for a future JIRA.

bq.  DistributedCache.java getLocalCache() (+190) should be indirected to 
DistributedCacheManager.getLocalCache(). Otherwise there is a stack overflow 
error - the method calls itself as of now.

Oy, good catch.

bq. I think most of the getter/setter methods are deprecable and moveable to 
DistributedCacheManager. Or at least we should give some thought about it. For 
most of them, I do see from the javadoc that they are only for internal use 
anyways.

I agree.  A few of them are used to manage the Configuration object.  (In my 
mind, we're
serializing and de-serializing a set of requirements for the distributed cache
into the text configuration, and doing so a bit haphazardly.)  I was very 
tempted
to remove all the ones that are only meant to be internal, but Tom advised me 
that
I need to keep them deprecated for a version.

Again, I think moving those methods into a more private place is a good task to 
do
along with changing how JobClient calls into this stuff.

bq. DistributedCacheHandle.makeClassLoader use File.toURI.toURL() instead of 
File.toURL() directly. File.toURL() is deprecated.

Done.

bq. LocalJobRunner.java TaskRunner.setupWorkDir needs to be fixed to have a 
workDir arg to help the code in LocalJobRunner. Compilation(ant binary) breaks 
now.

Oops, done.

bq. TODO: Test pipes with LocalJobRunner and make sure that it works, i.e. 
verify HADOOP-5174.

I don't use Pipes typically, and it doesn't seem to compile on my Mac.
I'll try it on a Linux machine, but if it's easy/handy for you, it'd be great
to verify that bug.

bq. TestDistributedCacheManager * Code related to setFileStamps in 
JobClient.java (+636 to +656) and testManagerFlow() (+71 to +74) can be 
refactored into an internal-only method in DistributedCacheManager.

I extracted the method and moved it to TrackerDistributedCacheManager.  Now that
that has "Tracker" in the name, it's a little misplaced, but it's not too bad.

bq. Minor: assertNonNull(+90) check can be moved after +91 and use 
localCacheFiles for the null check

Done.

bq. TestMRWithDistributedCache Just documenting a TODO: Will need to fix this 
class's javadoc once MAPREDUCE-711 is in.

Indeed.

bq. Fix the comment at +84. It is actually two archives.

done.

bq. Please put a comment for the class saying that it is NOT A FAST test, 
keeping in view the recent efforts to have a fast test target.

Done.

bq. Another point. We should consolidate TestMRWithDistributedCache, 
TestMiniMRLocalFS and TestMiniMRDFSCaching. That's a lot of code in three 
different files testing mostly common code. May be another JIRA issue if you 
feel so.

I agree.  I didn't find the relevant tests right away, and there are some bits
covered by all of them.  I'd prefer a separate JIRA.

That pretty much covers the points you brought up.  Take another look?

Cheers,

-- Philip


> extend DistributedCache to work locally (LocalJobRunner)
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-476
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: sam rash
>            Assignee: Philip Zeyliger
>            Priority: Minor
>         Attachments: HADOOP-2914-v1-full.patch, 
> HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, 
> MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, 
> MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, 
> MAPREDUCE-476.patch
>
>
> The DistributedCache does not work locally when using the outlined recipe at 
> http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html
>  
> Ideally, LocalJobRunner would take care of populating the JobConf and copying 
> remote files to the local file sytem (http, assume hdfs = default fs = local 
> fs when doing local development.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to