[
https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990131#comment-12990131
]
Sean Owen commented on MAHOUT-598:
----------------------------------
I don't know this code, but can see by tracing back that the issue likely
starts in DictionaryVectorizer.createTermFrequencyVectors(). Those
dictionaryChunks perhaps no longer have the s3n:// qualifier.
If you're running on Amazon EMR, I'd suggest you stick in one way or the other
the Hadoop parameter fs.default.name, set to "s3://[bucket]". This tells it to
assume that unqualified paths are relative to your S3 bucket. It may be just a
workaround, masking some unqualified usage of URIs, which isn't good, but,
might be of help.
> Downstream steps in the seq2sparse job flow looking in wrong location for
> output from previous steps when running in Elastic MapReduce (EMR) cluster
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-598
> URL: https://issues.apache.org/jira/browse/MAHOUT-598
> Project: Mahout
> Issue Type: Bug
> Components: Utils
> Affects Versions: 0.4
> Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2
> Reporter: Timothy Potter
> Priority: Minor
>
> While working on MAHOUT-588, I've discovered an issue with the seq2sparse job
> running on EMR. From what I can tell this job is made up of multiple MR steps
> and downstream steps are expecting output from previous steps to be in HDFS,
> but the output is in S3 (see errors below). For example, the
> DictionaryVectorizer wrote "dictionary.file.0" to S3 but
> TFPartialVectorReducer is looking for it in HDFS.
> To run this job, I spin up an EMR cluster and then add the following step to
> it (this is using the elastic-mapreduce-ruby tool):
> elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \
> --main-class org.apache.mahout.driver.MahoutDriver \
> --arg seq2sparse \
> --arg -i --arg
> s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \
> --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \
> --arg --weight --arg tfidf \
> --arg --chunkSize --arg 200 \
> --arg --minSupport --arg 2 \
> --arg --minDF --arg 1 \
> --arg --maxDFPercent --arg 90 \
> --arg --norm --arg 2 \
> --arg --maxNGramSize --arg 2 \
> --arg --overwrite \
> -j JOB_ID
> With these parameters, I see the following errors in the hadoop logs:
> java.io.FileNotFoundException: File does not exist:
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist:
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist:
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
> at
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Exception in thread "main"
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
> not exist:
> s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
> at
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
> at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:933)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:827)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> at
> org.apache.mahout.vectorizer.common.PartialVectorMerger.mergePartialVectors(PartialVectorMerger.java:126)
> at
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:176)
> at
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:253)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> I don't think this is a "config" error on my side because if I change the -o
> argument to:
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/
> then the job completes successfully, except the output is now stored in the
> hdfs and not S3. After the job completes successfully, if I SSH into the EMR
> master server, then I see the following output as expected:
> hadoop@ip-10-170-93-177:~$ hadoop fs -lsr
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/
> drwxr-xr-x - hadoop supergroup 0 2011-01-24 23:44
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count
> -rw-r--r-- 1 hadoop supergroup 26893 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00000
> -rw-r--r-- 1 hadoop supergroup 26913 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00001
> -rw-r--r-- 1 hadoop supergroup 26893 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00002
> -rw-r--r-- 1 hadoop supergroup 104874 2011-01-24 23:42
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> -rw-r--r-- 1 hadoop supergroup 80493 2011-01-24 23:44
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/frequency.file-0
> drwxr-xr-x - hadoop supergroup 0 2011-01-24 23:43
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/tf-vectors
> /part-r-00000
> ...
> The work-around is to just write all output to HDFS and then SSH into the
> master server once the job completes and then copy the output to S3.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira