In Step #, u r generating tf vectors but r expecting tf-idf vectors in Step 4.
Change the weight in Step 3 to tfidf (which is the default BTW if none specified). On Monday, January 27, 2014 1:44 PM, Ted Dunning <[email protected]> wrote: I am forwarding this to the list for Peyman. ----------------------------------------------------------------- I am trying to run the CVB (Mahout 0.8) on a directory of plain text files, following the procedure outlined below. However, I am not able to see the vectordump (step 6). Run without the "-c csv" flag the generated file is empty. However, if I use the flag "-c csv" the generated file starts with a series of numbers followed by an alphabetically organized series of unigrams (see below) #1,10,1163,12,121,13,14,141,1462,15,16,17,185,1901,197,2,201,2227,23,283,298,3,331,35,4,402,4351,445,5,57,58,6,68,7,9,987,a.m,ab,abc,abercrombie,abercrombies,ability Can someone point out what I am doing wrong? thank you 0: Set Paths > export HDFS_PATH=/path/to/hdfs/ > export LOCAL_PATH=/path/to/localfs 1: Put docs in HDFS using hadoop fs -put [-put <localsrc> ... <dst>] > hadoop fs -put $LOCAL_PATH/test $HDFS_PATH/rawdata 2: Generate sequence files (of Text) from a directory > mahout seqdirectory \ -i $HDFS_PATH/rawdata \ -o $HDFS_PATH/sequenced \ -c UTF-8 -chunk 5 3- Generate sparse Vector from Text sequence files > mahout seq2sparse \ -i $HDFS_PATH/sequenced \ -o $HDFS_PATH/sparseVectors \ -ow --maxDFPercent 85 --namedVector --weight tf 4- rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>} > mahout rowid \ -i $HDFS_PATH/sparseVectors/tfidf-vectors \ -o $HDFS_PATH/matrix 5- run cvb > mahout cvb \ -i $HDFS_PATH/matrix/matrix \ -o $HDFS_PATH/test-lda \ -k 100 -ow -x 40 \ -dict $HDFS_PATH/sparseVectors/dictionary.file-0 \ -dt $HDFS_PATH/test-lda-topics \ -mt $HDFS_PATH/test-lda-model 6- Dump vectors from a sequence file to text > mahout vectordump \ -i $HDFS_PATH/test-lda-topics/part-m-00000 \ -o $LOCAL_PATH/vectordump \ -vs 10 -p true \ -d $HDFS_PATH/sparseVectors/dictionary.file-0 \ -dt sequencefile \ -sort $HDFS_PATH/test-lda-topics/part-m-00000 \ -c csv ; cat $LOCAL_PATH/vectordump
