[jira] Commented: (HADOOP-576) Enhance streaming to use the new caching feature

Mahadev konar (JIRA) Thu, 26 Oct 2006 15:11:23 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-576?page=comments#action_12445020 ] 
            
Mahadev konar commented on HADOOP-576:
--------------------------------------


sorry the second line should have been

bin/hadoop jar hadoop-streaming.jar -cacheFile 
dfs://host:port/path_in_dfsof_file#NAME 
bin/hadoop jar hadoop-streaming.jar -cacheArchive 
dfs://host:port/path_in_dfsof_archive#NAME 

> Enhance streaming to use the new caching feature
> ------------------------------------------------
>
>                 Key: HADOOP-576
>                 URL: http://issues.apache.org/jira/browse/HADOOP-576
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Michel Tourn
>         Assigned To: Mahadev konar
>         Attachments: streaming.patch
>
>
> Design proposal to expose filecache access to Hadoop streaming.
> The main difference with the pure-Java filecache code is:
> 1. As part of job launch (in hadoopStreaming client) we validate presence of
> cached archives/files in DFS.
> 2. As part of Task initialization, a symbolic link to cached files/unarchived
> directories is created in the Task working directory.
> C1. New command-line options (example)
> -cachearchive dfs:/user/me/big.zip#big_1 
> -cachefile dfs:/user/other/big.zip#big_2 
> -cachearchive dfs:/user/me/bang.zip
> This maps to API calls to static methods:
> DistributedCache.addCacheArchive(URI uri, Configuration conf)
> DistributedCache.addCacheFile(URI uri, Configuration conf)
> This is done in class StreamJob methods parseArgv() and setJobConf().
> The code should be similar to the way "-file" is handled.
> One difference is that we now require a FileSystem instance to VALIDATE the 
> DFS
> paths in -cachefile and -cachearchive. The FileSystem instance should not be
> accessed before the filesystem is set by this: setUserJobConfProps(true);
> If FileSystem instance is "local" and there are -cachearchive/-cachefile
> options , then fail: this is not supported.
> Else this should return true:
> fs_.isFile(Path) for each -cachearchive/-cachefile option.
> Only in verbose mode: show the isFile status of each option.
> In any verbosity mode: show the first failed isFile() status and abort using
> method StreamJob.fail().
> C2. Task initialization
> The symlinks are called:
> Workingdir/big_1 (points to directory: /cache/user/me/big_zip)
> Workingdir/big_2 (points to file: /cache/user/other/big.zip)
> Workingdir/bang.zip (points to directory /cache/user/me/bang_zip)
> This will require hadoopStreaming to create symbolic links.
> Hadoop should have code to do this in a portable way.
> Although this may not be supported on non-Unix platforms. 
> Cross-platform support is harder than for hard-links. 
> Cygwin soft links are not a solution: they only work for applications 
> compiled with
> cygwin1.dll)
> Symbolic links make JUnit tests less portable.
> So maybe the test should run as part of ant target test-unix. (in 
> contrib/streaming/build.xml)
> The parameters after -cachearchive and -cachefile have the following
> properties:
> A. you can optionally give a name to your symlink (after #)
> B. the default name is the leaf name (big.zip, big.zip, bang.zip)
> C. if the same leaf name appears more than once you MUST give a name. 
> Otherwise
> streaming client aborts and complains. For example with this, Streaming client
> should complain:
> -cachearchive dfs:/user/me/big.zip 
> -cachefile dfs:/user/other/big.zip
> This complains because multiple occurrences of "big.zip" are not disambiguated
> with #big_1, #big_2.
> Ideally the Streaming client error message should then generate an example on
> how to fix the parameters:
> -cachearchive dfs:/user/me/big.zip#1
> -cachefile dfs:/user/other/big.zip#2
> ---------
> hadoop-Client note:
> Currently argv parsing is position-independant. i.e. changing the order of
> arguments never impacts the behaviour of hadoopStreaming. It would be good to
> keep this behaviour.
> URI notes:
> scheme is "dfs:" for consistency with current state of Hadoop code.
> However there is a proposal to change the scheme to "hdfs:"
> Using a URI fragment to give a local name to the resource is unusual. The main
> constraint is that the URI should remain parsable by java.net.URI(String). And
> encoding attributes in the fragment is standard (like CGI parameters in an 
> HTTP
> GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-576) Enhance streaming to use the new caching feature

Reply via email to