Not sure if this is something with my prod cluster, or a bug, but when running seq2sparse on my production hadoop cluster, I keep making it all the way through the tokenization, dictionary creation, etc, but then the TFPartialVectorReducer blows up:
11/07/30 06:00:04 INFO mapred.LocalJobRunner: 11/07/30 06:00:04 INFO mapred.TaskRunner: Task 'attempt_local_0003_m_000003_0' done. 11/07/30 06:00:04 INFO mapred.LocalJobRunner: 11/07/30 06:00:04 INFO mapred.Merger: Merging 4 sorted segments 11/07/30 06:00:04 INFO mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 243328920 bytes 11/07/30 06:00:04 INFO mapred.LocalJobRunner: 11/07/30 06:00:04 WARN mapred.LocalJobRunner: job_local_0003 java.lang.IllegalStateException: /user/jake/status_parsed/dictionary.file-0 at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63) at org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:130) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:215) Caused by: java.io.FileNotFoundException: File file:/user/jake/status_parsed/dictionary.file-0 does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:372) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:718) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:58) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61) ... 5 more 11/07/30 06:00:04 INFO mapred.JobClient: Job complete: job_local_0003 The file listed (without a filesystem uri!) "/user/jake/status_parsed/dictionary.file-0" exists on the cluster, but it's probably not showing up in the DistributedCache properly somehow. Anyone run into anything like this before? It's been a while since I've run seq2sparse on a real-hardware / managed cluster, not sure if it's me, or mahout, or a configuration setting somehow. -jake
