Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Quick tour of Mahout text processing from the command line 
(https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+Mahout+text+processing+from+the+command+line)

Added by Pat Ferrel:
---------------------------------------------------------------------
h1. {color:#000000}{*}Quick tour of Mahout text processing from the command 
line{*}{color}

{color:#000000}This is a concise quick tour of using the mahout command line to 
generate text analysis data. It follows examples from the{color} [Mahout in 
Action|http://manning.com/owen/]{color:#000000}book and uses the Reuters-21578 
data set. This is one simple path through vectorizing text, creating clusters 
and calculating similar documents. The examples will work locally or 
distributed on a hadoop cluster. With the small data set provided a local 
installation is probably fast enough.{color}

h1. {color:#000000}{*}Generate Mahout sequence files from text{*}{color}

{color:#000000}Get the{color} 
[Reuters-21578|http://www.daviddlewis.com/resources/testcollections/reuters21578/]
 
{color:#000000}(http://www.daviddlewis.com/resources/testcollections/reuters21578/)
 files and extract them in “./reuters”. They are in SGML format.{color}# 
{color:#000000}We first convert from SGML to text:{color}
{color:#000000}mvn \-e \-q exec:java 
\-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" 
\-Dexec.args="reuters/ reuters-extracted/"{color}
{color:#000000}If you plan to run this example on a hadoop cluster you will 
need to copy the files to HDFS, which is not covered here.{color}
# {color:#000000}Now turn raw text in a directory into mahout sequence 
files:{color}
{color:#000000}mahout seqdirectory \{color}
{color:#000000}   -c UTF-8 \{color}
{color:#000000}   -i examples/reuters-extracted/ \{color}
{color:#000000}   -o reuters-seqfiles{color}
# {color:#000000}Examine the sequence files with seqdumper:{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/chunk-0 \| more{color}
{color:#000000}you should see something like this:{color}

{color:#000000}Input Path: reuters-seqfiles/chunk-0{color}

{color:#000000}Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.Text{color}

{color:#000000}Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987 
15:01:01.79{color}

{color:#000000}BAHIA COCOA REVIEW{color}

{color:#000000}Showers continued throughout the week in the Bahia cocoa zone, 
alleviating the drought since early January and improving prospects for the 
coming temporao, although normal ...{color}
# {color:#000000}Create tfidf vectors.{color}
{color:#000000}mahout seq2sparse \{color}
{color:#000000}   -i reuters-seqfiles/ \{color}
{color:#000000}   -o reuters-vectors/ \{color}
{color:#000000}   -ow \-chunk 100 \{color}
{color:#000000}   -x 90 \{color}
{color:#000000}   -seq \{color}
{color:#000000}   -a 
com.finderbots.analyzers.LuceneStemmingAnalyzer \{color}
{color:#000000}   -ml 50 \{color}
{color:#000000}   -n 2 \{color}
{color:#000000}   -nv{color}
{color:#000000}This uses a custom lucene analyzer which incorporates several 
token filters to stem, toss numbers, stop words (from a list), and small words. 
n = 2 is best for cosine distance, which we are using in clustering and for 
similarity. x is 90 meaning that if a token appears in 90% of the docs it is 
considered a stop word. ml = 50 \-\- not sure what this does...{color}
{color:#000000}Note:{color} {color:#000000}get named vectors or it is difficult 
to map docs to clusters{color}
# {color:#000000}Examine the vectors if you like but they are not really human 
readable...{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/part-r-00000{color}
# {color:#000000}Examine the tokenized docs to make sure the custom analyzer 
did right:{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}   -s 
reuters-vectors/tokenized-documents/part-m-00000{color}
{color:#000000}This should show each doc with nice clean tokenized text with no 
numbers, stemmed, etc.{color}
# {color:#000000}Make sure to look at the dictionary. It has every token with 
the integer that references it. All the vectors will use the integer, not the 
token so a lookup is required to see what is really inside a vector.{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}   -s reuters-vectors/dictionary.file-0 \{color}
{color:#000000}   \| more{color}

h1. {color:#000000}{*}Cluster the documents using kmeans{*}{color}

# {color:#000000}Calculate clusters and put document into the them.{color}
{color:#000000}mahout kmeans \{color}
{color:#000000}   -i reuters-vectors/tfidf-vectors/ \{color}
{color:#000000}   -c reuters-kmeans-centroids \{color}
{color:#000000}   -cl \{color}
{color:#000000}   -o reuters-kmeans-clusters \{color}
{color:#000000}   -k 20 \{color}
{color:#000000}   -ow \{color}
{color:#000000}   -x 10 \{color}
{color:#000000}   -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure{color}
{color:#000000}This calculates cluster centroids and puts them in the output 
dir it then finds which vectors are included in the final clusters and puts 
them in output/clusteredPoints. If you leav out \-cl you will not get the 
mapping of doc to cluster.{color}
{color:#000000}Note: fuzzy kmeans it is pretty sensitive to the fuzzyness 
measure so you can get meaningless clusters so look at the \-m parameter to 
fkmeans before trying it. m = 2 produced garbage results.{color}
# {color:#000000}Examine the clusters and perhaps even do some anaylsis of how 
good the clusters are:{color}
{color:#000000}mahout clusterdump \{color}
{color:#000000}   -d reuters-vectors/dictionary.file-0 \{color}
{color:#000000}   -dt sequencefile \{color}
{color:#000000}   -s 
reuters-kmeans-clusters/clusters-3-final/part-r-00000 \{color}
{color:#000000}   -n 20 \{color}
{color:#000000}   -b 100 \{color}
{color:#000000}   -p 
reuters-kmeans-clusters/clusteredPoints/{color}
# {color:#000000}The clusteredPoints dir has the docs mapped into clusters, and 
if you created vectors with names (seq2sparse \-nv) you’ll see them. You also 
have the distance from the centroid using the distance measure supplied to the 
clustering driver. To look at this use seqdumper:{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000}   -s 
reuters-kmeans-clusters/clusteredPoints/part-m-00000 \{color}
{color:#000000}   \| more{color}

{color:#000000}You will see that the file contains{color}

{color:#000000}   key: clusterid, value: wt = % likelihood the 
vector is in cluster, distance from centroid, named vector belonging to the 
cluster, vector data.{color}

{color:#000000}For kmeans the likelihood will be 1.0 or 0. For example:{color}

{color:#000000}   Key: 21477: Value: wt: 1.0distance: 
0.9420744909793364  vec: /-tmp/reut2-000.sgm-158.txt = \[372:0.318, 
966:0.396, 3027:0.230, 8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 
14334:0.270, 14371:0.413\]{color}
{color:#000000}Clusters, of course, cannot have names. A simple solution is to 
construct a name from the top terms in the centroid output from 
clusterdump.{color}

h1. {color:#000000}{*}Calculate several similar docs to each doc in the 
data{*}{color}

{color:#000000}This will take all docs in the data set then for each calculate 
the 10 most similar docs. This is like “find more like this” type search but is 
calculated in the background. This seems to be fast and requires only three 
mapreduce passes.{color}# {color:#000000}First create a matrix from the 
vectors:{color}
{color:#000000}mahout rowid \{color}
{color:#000000}   -i 
reuters-vectors/tfidf-vectors/part-r-00000{color}
{color:#000000}   -o reuters-matrix{color}
{color:#000000}You’ll get output announcing the number of columns/dimensions in 
the doc collection stored in the matrix. I looks like this:{color}
{color:#000000}Wrote out matrix with 21578 rows and 19515 columns to 
reuters-matrix/matrix{color}
{color:#000000}Save the number of column since it is needed in the next step. 
Also note that this creates a reuters-matrix/docIndex file where the rowids are 
mapped to docids. In the case of this example it will be rowid-->file name 
since we created named vectors in seq2sparse.{color}
{color:#000000}Note: This does not create a Mahout Matrix class but a sequence 
file so use seqdumper to examine the results.{color}
# {color:#000000}Create a collection of similar docs to each row of the matrix 
above:{color}
{color:#000000}mahout rowsimilarity \{color}
{color:#000000}   -i reuters-named-matrix/matrix \{color}
{color:#000000}   -o reuters-named-similarity \{color}
{color:#000000}   -r 19515{color}
{color:#000000}   --similarityClassname SIMILARITY_COSINE{color}
{color:#000000}   -m 10{color}
{color:#000000}   -ess{color}
{color:#000000}This will generate the 10 most similar docs to each doc in the 
collection.{color}
# {color:#000000}Examine the similarity list:{color}
{color:#000000}mahout seqdumper \-s reuters-matrix/matrix \| more{color}
{color:#000000}Which should look something like this{color}
{color:#000000}   {color}{color:#000000}Key: 0: Value: 
{14458:0.2966480826934176,11399:0.30290014772966095,{color}
{color:#000000}  12793:0.22009858979452146,3275:0.1871791030103281,{color}
{color:#000000}  14613:0.3534278632679437,4411:0.2516380602790199,{color}
{color:#000000}  17520:0.3139731583634198,13611:0.18968888212315968,{color}
{color:#000000}  14354:0.17673965754661425,0:1.0000000000000004}{color}
{color:#000000}For each rowid there is a list of ten rowids and distances. 
These corespond to documents and distance created by the 
\--similarityCalssname. In this case they are cosines of the angle between doc 
and similar doc. Look in the reuters-matrix/docIndex to find rowid to docid 
mapping. It should look something like this:{color}
{color:#000000}   Key: 0: Value: /-tmp/reut2-000.sgm-0.txt{color}
{color:#000000}   Key: 1: Value: /-tmp/reut2-000.sgm-1.txt{color}
{color:#000000}   Key: 2: Value: 
/-tmp/reut2-000.sgm-10.txt{color}
{color:#000000}   Key: 3: Value: 
/-tmp/reut2-000.sgm-100.txt{color}

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

Reply via email to