[ http://issues.apache.org/jira/browse/HADOOP-576?page=comments#action_12445020 ] Mahadev konar commented on HADOOP-576: --------------------------------------
sorry the second line should have been bin/hadoop jar hadoop-streaming.jar -cacheFile dfs://host:port/path_in_dfsof_file#NAME bin/hadoop jar hadoop-streaming.jar -cacheArchive dfs://host:port/path_in_dfsof_archive#NAME > Enhance streaming to use the new caching feature > ------------------------------------------------ > > Key: HADOOP-576 > URL: http://issues.apache.org/jira/browse/HADOOP-576 > Project: Hadoop > Issue Type: Improvement > Components: contrib/streaming > Reporter: Michel Tourn > Assigned To: Mahadev konar > Attachments: streaming.patch > > > Design proposal to expose filecache access to Hadoop streaming. > The main difference with the pure-Java filecache code is: > 1. As part of job launch (in hadoopStreaming client) we validate presence of > cached archives/files in DFS. > 2. As part of Task initialization, a symbolic link to cached files/unarchived > directories is created in the Task working directory. > C1. New command-line options (example) > -cachearchive dfs:/user/me/big.zip#big_1 > -cachefile dfs:/user/other/big.zip#big_2 > -cachearchive dfs:/user/me/bang.zip > This maps to API calls to static methods: > DistributedCache.addCacheArchive(URI uri, Configuration conf) > DistributedCache.addCacheFile(URI uri, Configuration conf) > This is done in class StreamJob methods parseArgv() and setJobConf(). > The code should be similar to the way "-file" is handled. > One difference is that we now require a FileSystem instance to VALIDATE the > DFS > paths in -cachefile and -cachearchive. The FileSystem instance should not be > accessed before the filesystem is set by this: setUserJobConfProps(true); > If FileSystem instance is "local" and there are -cachearchive/-cachefile > options , then fail: this is not supported. > Else this should return true: > fs_.isFile(Path) for each -cachearchive/-cachefile option. > Only in verbose mode: show the isFile status of each option. > In any verbosity mode: show the first failed isFile() status and abort using > method StreamJob.fail(). > C2. Task initialization > The symlinks are called: > Workingdir/big_1 (points to directory: /cache/user/me/big_zip) > Workingdir/big_2 (points to file: /cache/user/other/big.zip) > Workingdir/bang.zip (points to directory /cache/user/me/bang_zip) > This will require hadoopStreaming to create symbolic links. > Hadoop should have code to do this in a portable way. > Although this may not be supported on non-Unix platforms. > Cross-platform support is harder than for hard-links. > Cygwin soft links are not a solution: they only work for applications > compiled with > cygwin1.dll) > Symbolic links make JUnit tests less portable. > So maybe the test should run as part of ant target test-unix. (in > contrib/streaming/build.xml) > The parameters after -cachearchive and -cachefile have the following > properties: > A. you can optionally give a name to your symlink (after #) > B. the default name is the leaf name (big.zip, big.zip, bang.zip) > C. if the same leaf name appears more than once you MUST give a name. > Otherwise > streaming client aborts and complains. For example with this, Streaming client > should complain: > -cachearchive dfs:/user/me/big.zip > -cachefile dfs:/user/other/big.zip > This complains because multiple occurrences of "big.zip" are not disambiguated > with #big_1, #big_2. > Ideally the Streaming client error message should then generate an example on > how to fix the parameters: > -cachearchive dfs:/user/me/big.zip#1 > -cachefile dfs:/user/other/big.zip#2 > --------- > hadoop-Client note: > Currently argv parsing is position-independant. i.e. changing the order of > arguments never impacts the behaviour of hadoopStreaming. It would be good to > keep this behaviour. > URI notes: > scheme is "dfs:" for consistency with current state of Hadoop code. > However there is a proposal to change the scheme to "hdfs:" > Using a URI fragment to give a local name to the resource is unusual. The main > constraint is that the URI should remain parsable by java.net.URI(String). And > encoding attributes in the fragment is standard (like CGI parameters in an > HTTP > GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira