Not sure if this is something with my prod cluster, or a bug, but when
running seq2sparse on my production hadoop cluster, I keep making it all the
way through the tokenization, dictionary creation, etc, but then the
TFPartialVectorReducer blows up:

11/07/30 06:00:04 INFO mapred.LocalJobRunner:
11/07/30 06:00:04 INFO mapred.TaskRunner: Task
'attempt_local_0003_m_000003_0' done.
11/07/30 06:00:04 INFO mapred.LocalJobRunner:
11/07/30 06:00:04 INFO mapred.Merger: Merging 4 sorted segments
11/07/30 06:00:04 INFO mapred.Merger: Down to the last merge-pass, with 4
segments left of total size: 243328920 bytes
11/07/30 06:00:04 INFO mapred.LocalJobRunner:
11/07/30 06:00:04 WARN mapred.LocalJobRunner: job_local_0003
java.lang.IllegalStateException: /user/jake/status_parsed/dictionary.file-0
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
 at
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:130)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:215)
Caused by: java.io.FileNotFoundException: File
file:/user/jake/status_parsed/dictionary.file-0 does not exist.
 at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:372)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:718)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
 at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58)
 at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
... 5 more
11/07/30 06:00:04 INFO mapred.JobClient: Job complete: job_local_0003


The file listed (without a filesystem uri!)
"/user/jake/status_parsed/dictionary.file-0" exists on the cluster, but it's
probably not showing up in the DistributedCache properly somehow.

Anyone run into anything like this before?  It's been a while since I've run
seq2sparse on a real-hardware / managed cluster, not sure if it's me, or
mahout, or a configuration setting somehow.

  -jake

Reply via email to