[
https://issues.apache.org/jira/browse/MAPREDUCE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145669#comment-13145669
]
Robert Joseph Evans commented on MAPREDUCE-3323:
------------------------------------------------
I have read through all of your patches and I have a few comments.
# I don't really like the name of current.task.type.internal. It would be
better to prefix it with mapreduce.
# I think it is slightly faster to change {code}fileURI.toArray(new
URI[0]){code} to {code}fileURI.toArray(new URI[fileURI.size()]){code}, but this
is just a nit.
# There are no tests in the patches. I know you have done some manual testing,
but adding/updating the unit tests is important for this to be accepted in.
# Have you tested add(Archive|File)ToClassPathFor(Map|Reduce)? They set
"mapred.job.classpath.(archives|files)" so if you use these methods some of the
entries in "mapred.job.classpath.(archives|files)" will not be valid
# Why are you setting CACHE_(FILE|ARCHIVE)_FOR_(MAP|REDUCE)? It seems like you
could just go off of the existence of CACHE_(ARCHIVES|FILES)_(MAP|REDUCE).
# could you please add in the new user facing configuration keys to
mapred-default.xml so that they are documented.
> Add new interface for Distributed Cache, which special for Map or Reduce,but
> not Both.
> ---------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-3323
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3323
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: distributed-cache, tasktracker
> Affects Versions: 0.20.203.0
> Reporter: Azuryy(Chijiong)
> Fix For: 0.20.203.0
>
> Attachments: DistributedCache.patch, GenericOptionsParser.patch,
> JobClient.patch, TaskDistributedCacheManager.patch, TaskTracker.patch
>
>
> We put some file into Distributed Cache, but sometimes, only Map or Reduce
> use thses cached files, not useful for both. but TaskTracker always download
> cached files from HDFS, if there are some little bit big files in cache, it's
> time expensive.
> so, this patch add some new API in the DistributedCache.java as follow:
> addArchiveToClassPathForMap
> addArchiveToClassPathForReduce
> addFileToClassPathForMap
> addFileToClassPathForReduce
> addCacheFileForMap
> addCacheFileForReduce
> addCacheArchiveForMap
> addCacheArchiveForReduce
> New API doesn't affect original interface. User can use these features like
> the following two methods:
> 1)
> hadoop job **** -files file1 -mapfiles file2 -reducefiles file3 -archives
> arc1 -maparchives arc2 -reduce archives arc3
> 2)
> DistributedCache.addCacheFile(conf, file1);
> DistributedCache.addCacheFileForMap(conf, file2);
> DistributedCache.addCacheFileForReduce(conf, file3);
> DistributedCache.addCacheArchives(conf, arc1);
> DistributedCache.addCacheArchivesForMap(conf, arc2);
> DistributedCache.addCacheFArchivesForReduce(conf, arc3);
> These two methods have the same result, That's mean:
> You put six files to the distributed cache: file1 ~ file3, arc1 ~ arc3,
> but file1 and arc1 are cached for both map and reduce;
> file2 and arc2 are only cached for map;
> file3 and arc3 are only cached for reduce;
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira