[jira] [Commented] (MAHOUT-598) Downstream steps in the seq2sparse job flow looking in wrong location for output from previous steps when running in Elastic MapReduce (EMR) cluster

Elmer Garduno (JIRA) Sun, 05 Jun 2011 09:12:30 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044564#comment-13044564
 ]


Elmer Garduno commented on MAHOUT-598:
--------------------------------------

The issue seems to be that  

URI[] localFiles = DistributedCache.getCacheFiles(conf);

Doesn't return a fully qualified path if fs.default.name is not set.

Could something like the patch proposed for 
https://issues.apache.org/jira/browse/MAHOUT-700 fix this problem?




> Downstream steps in the seq2sparse job flow looking in wrong location for 
> output from previous steps when running in Elastic MapReduce (EMR) cluster
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-598
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-598
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>    Affects Versions: 0.4, 0.5
>         Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2
>            Reporter: Timothy Potter
>            Assignee: Robin Anil
>             Fix For: 0.6
>
>
> While working on MAHOUT-588, I've discovered an issue with the seq2sparse job 
> running on EMR. From what I can tell this job is made up of multiple MR steps 
> and downstream steps are expecting output from previous steps to be in HDFS, 
> but the output is in S3 (see errors below). For example, the 
> DictionaryVectorizer wrote "dictionary.file.0" to S3 but 
> TFPartialVectorReducer is looking for it in HDFS.
> To run this job, I spin up an EMR cluster and then add the following step to 
> it (this is using the elastic-mapreduce-ruby tool):
> elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \
> --main-class org.apache.mahout.driver.MahoutDriver \
> --arg seq2sparse \
> --arg -i --arg 
> s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \
> --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \
> --arg --weight --arg tfidf \
> --arg --chunkSize --arg 200 \
> --arg --minSupport --arg 2 \
> --arg --minDF --arg 1 \
> --arg --maxDFPercent --arg 90 \
> --arg --norm --arg 2 \
> --arg --maxNGramSize --arg 2 \
> --arg --overwrite \
> -j JOB_ID
> With these parameters, I see the following errors in the hadoop logs:
> java.io.FileNotFoundException: File does not exist: 
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
>       at 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist: 
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
>       at 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist: 
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
>       at 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Exception in thread "main" 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
>       at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:933)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:827)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at 
> org.apache.mahout.vectorizer.common.PartialVectorMerger.mergePartialVectors(PartialVectorMerger.java:126)
>       at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:176)
>       at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:253)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> I don't think this is a "config" error on my side because if I change the -o 
> argument to:
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ 
> then the job completes successfully, except the output is now stored in the 
> hdfs and not S3. After the job completes successfully, if I SSH into the EMR 
> master server, then I see the following output as expected:
> hadoop@ip-10-170-93-177:~$ hadoop fs -lsr 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/
> drwxr-xr-x   - hadoop supergroup          0 2011-01-24 23:44 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count
> -rw-r--r--   1 hadoop supergroup      26893 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00000
> -rw-r--r--   1 hadoop supergroup      26913 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00001
> -rw-r--r--   1 hadoop supergroup      26893 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00002
> -rw-r--r--   1 hadoop supergroup     104874 2011-01-24 23:42 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> -rw-r--r--   1 hadoop supergroup      80493 2011-01-24 23:44 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/frequency.file-0
> drwxr-xr-x   - hadoop supergroup          0 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/tf-vectors
> /part-r-00000
> ...
> The work-around is to just write all output to HDFS and then SSH into the 
> master server once the job completes and then copy the output to S3.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-598) Downstream steps in the seq2sparse job flow looking in wrong location for output from previous steps when running in Elastic MapReduce (EMR) cluster

Reply via email to