Hi,

Hadoop/HDFS newbie.  Been struggling with getting the streaming example working 
with -archives.   c.f.  
http://hadoop.apache.org/common/docs/r0.20.1/streaming.html#Large+files+and+archives+in+Hadoop+Streaming

My environment is the Pseudo-distributed environment setup per: 
http://hadoop.apache.org/common/docs/current/quickstart.html#PseudoDistributed

I've run into a couple issues.   First issue is "FileNotFoundException" when 
the #symlink suffix is specified with the -archives or -files options as per 
the tutorial.
        
hadoop jar $HADOOP_HOME/hadoop-0.20.1-streaming.jar -archives 
"hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink" -input 
"samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output 
"samples/cachefile/out"
java.io.FileNotFoundException: File 
hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink does not 
exist.
        at 
org.apache.hadoop.util.GenericOptionsParser.validateFiles(GenericOptionsParser.java:349)
        at 
org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:275)
        at 
org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:375)
        at 
org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
        at 
org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:138)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at 
org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:32)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

If I remove the "#testlink" from the archives definition, the error goes away 
but the symlink is not created, as per the tutorial documentation.

I've seen this JIRA issue http://issues.apache.org/jira/browse/HADOOP-6178, 
shows no FIX version, but the Issue Links to others which are supposedly fixed 
in 0.20.1 which I have.

2nd issue is "Unrecognized option -archives" when -archives is specified at the 
end of the arg list.  

hadoop jar $HADOOP_HOME/hadoop/hadoop-0.20.1-streaming.jar -input 
"samples/cachefile/input.txt" -mapper "xargs cat" -reducer "cat" -output 
"samples/cachefile/out9" -archives 
"hdfs://localhost:9000/user/me/samples/cachefile/cachedir.jar#testlink"
10/02/19 14:29:11 ERROR streaming.StreamJob: Unrecognized option: -archives

Any help getting past this appreciated.    Am I missing a configuration setting 
that allows symlinking?  Really hoping to use the archives feature.

-Michael
 

Reply via email to