Hello Mahout developers / users,

I am trying to convert a properly formatted SequenceFile to Mahout vectors
to run LDA on them. As reference I am using these two documents:
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html
http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

I got the Mahout code from SVN on February 11th 2010. Below I am listing the
steps I have took and the problems I have encountered:

export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/
export MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/

$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i
/user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o
/user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 -a
org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport 2 --minDF
1 --maxDFPercent 50 --norm 2

*Problem #1: *Got this error at the end, but I think everything finished
more or less correctly:
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V
at
org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173)
at
org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.clustering.lda.LDADriver -i
/user/MY_USERNAME/projects/lda/mahout_vectors/ -o
/user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
--numReducers 33

*Problem #2: *Exception in thread "main" java.io.FileNotFoundException: File
does not exist:
hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data

*Tried to fix:*

../../hadoop fs -mv
/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000
/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data

*Ran again:*

$HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.clustering.lda.LDADriver -i
/user/MY_USERNAME/projects/lda/mahout_vectors/ -o
/user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000
--numReducers 33

*Problem #3:*

Exception in thread "main" java.io.FileNotFoundException: File does not
exist:
hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data

[had...@some_server retweets]$ ../../hadoop fs -ls
/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/
Found 3 items
-rw-r--r--   3 hadoop supergroup  129721338 2010-02-11 23:54
/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000
-rw-r--r--   3 hadoop supergroup  128256085 2010-02-11 23:54
/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001
-rw-r--r--   3 hadoop supergroup   24160265 2010-02-11 23:54
/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002

Also, as a *bonus problem*, If the input
folder /user/MY_USERNAME/projects/lda/twitter_sequence_files contains more
than one file (for example if I run only the maps without a final reducer),
this whole chain doesn't work.

Thanks,
Ovi

---
Ovidiu Dan - http://www.ovidiudan.com/

Please do not print this e-mail unless it is mandatory

My public key can be downloaded from subkeys.pgp.net, or
http://www.ovidiudan.com/public.pgp

Reply via email to