Hi Michael, There is bug with passing symlink name for -files and -archives options . See MAPREDUCE-787. If you don't pass any symlink name for the uri in -files and -archives, it creates a symlink with actual name. So, if you pass -archives "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar", a symlink with name cachedir.jar will be created.
-files and -archives are Generic options. For all commands, generic options should be followed by command options. The above documentation is corrected in MAPREDUCE-813. Thanks Amareshwari On 2/20/10 9:57 AM, "Michael Kintzer" <[email protected]> wrote: > > Hi, > > Hadoop/HDFS newbie. Been struggling with getting the streaming example > working with -archives. c.f. > http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming > > My environment is the Pseudo-distributed environment setup per: > http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed > > I've run into a couple issues. First issue is "FileNotFoundException" when > the #symlink suffix is specified with the -archives or -files options as per > the tutorial. > > hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives > "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" > -input "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" > -output "samples/cachefile/out" > java.io.FileNotFoundException: File > hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does > not exist. > at > org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349) > at > org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275) > at > org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375) > at > org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153) > at > org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at > org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > If I remove the "#testlink" from the archives definition, the error goes away > but the symlink is not created, as per the tutorial documentation. > > I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, > shows no FIX version, but the Issue Links to others which are supposedly > fixed in 0.20.1 which I have. > > 2nd issue is "Unrecognized option -archives" when -archives is specified at > the end of the arg list. > > hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input > "samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output > "samples/cachefile/out9" -archives > "hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" > 10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives > > Any help getting past this appreciated. Am I missing a configuration > setting that allows symlinking? Really hoping to use the archives feature. > > -Michael
