Hi again, Is there any workaround for my problem(s)? Or is there any other way that would allow me to transform many many small messages (they're Tweets) into Mahout vectors, and the run LDA on them, without getting these errors? Converting them to txt files would be a bit of a pain because I would get millions of very small files. And a Lucene index would be a bit overkill I think.
Thanks, Ovi --- Ovidiu Dan - http://www.ovidiudan.com/ Please do not print this e-mail unless it is mandatory My public key can be downloaded from subkeys.pgp.net, or http://www.ovidiudan.com/public.pgp On Fri, Feb 12, 2010 at 3:51 AM, Robin Anil <[email protected]> wrote: > Was meant for the dev list. I am looking into the first error > > -bcc mahout-user > > > ---------- Forwarded message ---------- > From: Robin Anil <[email protected]> > Date: Fri, Feb 12, 2010 at 2:20 PM > Subject: Re: Problem converting SequenceFile to vectors, then running LDA > To: [email protected] > > > Hi, > > This confusion arises from the fact that we use intermediate folders > as subfolders under output folder. How about we standardize on all the jobs > taking input, intermediate and output folder?. If not this then for the > next > release? > > Robin > > > > > On Fri, Feb 12, 2010 at 10:46 AM, Ovidiu Dan <[email protected]> wrote: > > > Hello Mahout developers / users, > > > > I am trying to convert a properly formatted SequenceFile to Mahout > vectors > > to run LDA on them. As reference I am using these two documents: > > http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html > > http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html > > > > I got the Mahout code from SVN on February 11th 2010. Below I am listing > > the > > steps I have took and the problems I have encountered: > > > > export HADOOP_HOME=/home/hadoop/hadoop/hadoop_install/ > > export MAHOUT_HOME=/home/hadoop/hadoop/hadoop_install/bin/ovi/lda/trunk/ > > > > $HADOOP_HOME/bin/hadoop jar > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > org.apache.mahout.text.SparseVectorsFromSequenceFiles -i > > /user/MY_USERNAME/projects/lda/twitter_sequence_files/ -o > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -wt tf -chunk 300 -a > > org.apache.lucene.analysis.standard.StandardAnalyzer --minSupport 2 > --minDF > > 1 --maxDFPercent 50 --norm 2 > > > > *Problem #1: *Got this error at the end, but I think everything finished > > more or less correctly: > > Exception in thread "main" java.lang.NoSuchMethodError: > > > > > org.apache.mahout.common.HadoopUtil.deletePath(Ljava/lang/String;Lorg/apache/hadoop/fs/FileSystem;)V > > at > > > > > org.apache.mahout.utils.vectors.text.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:173) > > at > > > > > org.apache.mahout.text.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:254) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > $HADOOP_HOME/bin/hadoop jar > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > org.apache.mahout.clustering.lda.LDADriver -i > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 > > --numReducers 33 > > > > *Problem #2: *Exception in thread "main" java.io.FileNotFoundException: > > File > > does not exist: > > > > > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > > > > *Tried to fix:* > > > > ../../hadoop fs -mv > > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/part-00000 > > /user/MY_USERNAME/projects/lda/mahout_vectors/partial-vectors-0/data > > > > *Ran again:* > > > > $HADOOP_HOME/bin/hadoop jar > > $MAHOUT_HOME/examples/target/mahout-examples-0.3-SNAPSHOT.job > > org.apache.mahout.clustering.lda.LDADriver -i > > /user/MY_USERNAME/projects/lda/mahout_vectors/ -o > > /user/MY_USERNAME/projects/lda/lda_out/ -k 20 --numWords 100000 > > --numReducers 33 > > > > *Problem #3:* > > > > Exception in thread "main" java.io.FileNotFoundException: File does not > > exist: > > > > > hdfs://SOME_SERVER:8003/user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/data > > > > [had...@some_server retweets]$ ../../hadoop fs -ls > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/ > > Found 3 items > > -rw-r--r-- 3 hadoop supergroup 129721338 2010-02-11 23:54 > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00000 > > -rw-r--r-- 3 hadoop supergroup 128256085 2010-02-11 23:54 > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00001 > > -rw-r--r-- 3 hadoop supergroup 24160265 2010-02-11 23:54 > > > > > /user/MY_USERNAME/projects/lda/mahout_vectors/tokenized-documents/part-00002 > > > > Also, as a *bonus problem*, If the input > > folder /user/MY_USERNAME/projects/lda/twitter_sequence_files contains > more > > than one file (for example if I run only the maps without a final > reducer), > > this whole chain doesn't work. > > > > Thanks, > > Ovi > > > > --- > > Ovidiu Dan - http://www.ovidiudan.com/ > > > > Please do not print this e-mail unless it is mandatory > > > > My public key can be downloaded from subkeys.pgp.net, or > > http://www.ovidiudan.com/public.pgp > > >
