[
https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044564#comment-13044564
]
Elmer Garduno commented on MAHOUT-598:
--------------------------------------
The issue seems to be that
URI[] localFiles = DistributedCache.getCacheFiles(conf);
Doesn't return a fully qualified path if fs.default.name is not set.
Could something like the patch proposed for
https://issues.apache.org/jira/browse/MAHOUT-700 fix this problem?
> Downstream steps in the seq2sparse job flow looking in wrong location for
> output from previous steps when running in Elastic MapReduce (EMR) cluster
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-598
> URL: https://issues.apache.org/jira/browse/MAHOUT-598
> Project: Mahout
> Issue Type: Bug
> Components: Integration
> Affects Versions: 0.4, 0.5
> Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2
> Reporter: Timothy Potter
> Assignee: Robin Anil
> Fix For: 0.6
>
>
> While working on MAHOUT-588, I've discovered an issue with the seq2sparse job
> running on EMR. From what I can tell this job is made up of multiple MR steps
> and downstream steps are expecting output from previous steps to be in HDFS,
> but the output is in S3 (see errors below). For example, the
> DictionaryVectorizer wrote "dictionary.file.0" to S3 but
> TFPartialVectorReducer is looking for it in HDFS.
> To run this job, I spin up an EMR cluster and then add the following step to
> it (this is using the elastic-mapreduce-ruby tool):
> elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \
> --main-class org.apache.mahout.driver.MahoutDriver \
> --arg seq2sparse \
> --arg -i --arg
> s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \
> --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \
> --arg --weight --arg tfidf \
> --arg --chunkSize --arg 200 \
> --arg --minSupport --arg 2 \
> --arg --minDF --arg 1 \
> --arg --maxDFPercent --arg 90 \
> --arg --norm --arg 2 \
> --arg --maxNGramSize --arg 2 \
> --arg --overwrite \
> -j JOB_ID
> With these parameters, I see the following errors in the hadoop logs:
> java.io.FileNotFoundException: File does not exist:
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist:
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist:
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Exception in thread "main"
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
> not exist:
> s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
> at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
> at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:933)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:827)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> at
> org.apache.mahout.vectorizer.common.PartialVectorMerger.mergePartialVectors(PartialVectorMerger.java:126)
> at
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:176)
> at
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:253)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> I don't think this is a "config" error on my side because if I change the -o
> argument to:
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/
> then the job completes successfully, except the output is now stored in the
> hdfs and not S3. After the job completes successfully, if I SSH into the EMR
> master server, then I see the following output as expected:
> hadoop@ip-10-170-93-177:~$ hadoop fs -lsr
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/
> drwxr-xr-x - hadoop supergroup 0 2011-01-24 23:44
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count
> -rw-r--r-- 1 hadoop supergroup 26893 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00000
> -rw-r--r-- 1 hadoop supergroup 26913 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00001
> -rw-r--r-- 1 hadoop supergroup 26893 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00002
> -rw-r--r-- 1 hadoop supergroup 104874 2011-01-24 23:42
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> -rw-r--r-- 1 hadoop supergroup 80493 2011-01-24 23:44
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/frequency.file-0
> drwxr-xr-x - hadoop supergroup 0 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/tf-vectors
> /part-r-00000
> ...
> The work-around is to just write all output to HDFS and then SSH into the
> master server once the job completes and then copy the output to S3.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira