Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Quick tour of text analysis using the Mahout command line
(https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line)
Added by Pat Ferrel:
---------------------------------------------------------------------
h2. {color:#000000}{*}Introduction{*}{color}
{color:#000000}{*}This is a concise quick tour using the mahout command line to
generate text analysis data. It follows examples from the{*}{color} [Mahout in
Action|http://manning.com/owen/]{color:#000000}{*}book and uses the
Reuters-21578 data set. This is one simple path through vectorizing text,
creating clusters and calculating similar documents. The examples will work
locally or distributed on a hadoop cluster. With the small data set provided a
local installation is probably fast enough.*{color}
h2. {color:#000000}{*}Generate Mahout vectors from text{*}{color}
{color:#000000}{*}Get the{*}{color}
[Reuters-21578|http://www.daviddlewis.com/resources/testcollections/reuters21578/]
{color:#000000}*(http://www.daviddlewis.com/resources/testcollections/reuters21578/)
files and extract them in “./reuters”. They are in SGML format. Mahout will
also create sequence files from raw text and other formats. At the end of this
section you will have the text files turned into vectors, which are basically
lists of weighted token. The weights are calculated to indicate the importance
of each token.*{color}
# {color:#000000}{*}Convert from SGML to text:*{color}
{color:#000000}{*}mvn \-e \-q exec:java
\-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
\-Dexec.args="reuters/ reuters-extracted/"*{color}
{color:#000000}{*}If you plan to run this example on a hadoop cluster you will
need to copy the files to HDFS, which is not covered here.*{color}
# {color:#000000}Now turn raw text in a directory into mahout sequence
files:{color}
{color:#000000}mahout seqdirectory \{color}
{color:#000000} -c UTF-8 \{color}
{color:#000000} -i examples/reuters-extracted/ \{color}
{color:#000000} -o reuters-seqfiles{color}
# {color:#000000}Examine the sequence files with seqdumper:{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/chunk-0 \| more{color}
{color:#000000}you should see something like this:{color}
{color:#000000}Input Path: reuters-seqfiles/chunk-0{color}
{color:#000000}Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.Text{color}
{color:#000000}Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987
15:01:01.79{color}
{color:#000000}BAHIA COCOA REVIEW{color}
{color:#000000}Showers continued throughout the week in the Bahia cocoa zone,
alleviating the drought since early January and improving prospects for the
coming temporao, although normal …{color}
# {color:#000000}Create tfidf weighted vectors.{color}
{color:#000000}mahout seq2sparse \{color}
{color:#000000} -i reuters-seqfiles/ \{color}
{color:#000000} -o reuters-vectors/ \{color}
{color:#000000} -ow \-chunk 100 \{color}
{color:#000000} -x 90 \{color}
{color:#000000} -seq \{color}
{color:#000000} -ml 50 \{color}
{color:#000000} -n 2 \{color}
{color:#000000} -nv{color}
{color:#000000}This uses the default analyzer and default TFIDF weighting, \-n
2 is good for cosine distance, which we are using in clustering and for
similarity, \-x 90 meaning that if a token appears in 90% of the docs it is
considered a stop word, \-nv to get named vectors making further data files
easier to inspect.{color}
# {color:#000000}Examine the vectors if you like but they are not really human
readable...{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/part-r-00000{color}
# {color:#000000}Examine the tokenized docs to make sure the analyzer is
filtering out enough (note that the rest of this example used a more
restrictive lucene analyzer and not the default so your result may vary):{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000} -s
reuters-vectors/tokenized-documents/part-m-00000{color}
{color:#000000}This should show each doc with nice clean tokenized text.{color}
# {color:#000000}Examine the dictionary. It maps token id to token text.{color}
{color:#000000}mahout seqdumper \{color}
{color:#000000} -s reuters-vectors/dictionary.file-0 \{color}
{color:#000000} \| more{color}
h2. {color:#000000}{*}Cluster documents using kmeans{*}{color}
# {color:#000000}{*}Calculate clusters and put documents into the them.*{color}
{color:#000000}{*}mahout kmeans \{*}{color}
{color:#000000}* -i reuters-vectors/tfidf-vectors/ \{*}{color}
{color:#000000}* -c reuters-kmeans-centroids \{*}{color}
{color:#000000}* -cl \{*}{color}
{color:#000000}* -o reuters-kmeans-clusters \{*}{color}
{color:#000000}* -k 20 \{*}{color}
{color:#000000}* -ow \{*}{color}
{color:#000000}* -x 10 \{*}{color}
{color:#000000}* -dm
org.apache.mahout.common.distance.CosineDistanceMeasure{*}{color}
{color:#000000}{*}This calculates cluster centroids and puts them in the output
dir it then finds which vectors are included in the final clusters and puts
them in{*}{color}
{color:#000000}{*}output/clusteredPoints{*}{color}{color:#000000}*. If you
leave out \-cl you will not get the mapping of doc to cluster.*{color}
{color:#000000}{*}You should see the following file create:*{color}
{color:#000000}{*}host$ ls reuters-kmeans-clusters/*{color}
{color:#000000}{*}clusteredPoints clusters-1 clusters-2
clusters-3-final{*}{color}
{color:#000000}{*}Here{*}{color} {color:#000000}{*}clusters-3-final{*}{color}
{color:#000000}{*}has the final cluster centroids and{*}{color}
{color:#000000}{*}clusteredPoints{*}{color} {color:#000000}{*}has the docs
assigned to each cluster.*{color}
{color:#000000}{*}Note: The fkmeans driver, which implements fuzzy kmeans, is
pretty sensitive to the fuzzyness measure so look at the \-m parameter to
fkmeans before trying it. \-m 2 produced poor results.*{color}
# {color:#000000}{*}Examine the clusters and perhaps even do some anaylsis of
how good the clusters are:*{color}
{color:#000000}{*}mahout clusterdump \{*}{color}
{color:#000000}* -d reuters-vectors/dictionary.file-0
\{*}{color}
{color:#000000}* -dt sequencefile \{*}{color}
{color:#000000}* -s
reuters-kmeans-clusters/clusters-3-final/part-r-00000 \{*}{color}
{color:#000000}* -n 20 \{*}{color}
{color:#000000}* -b 100 \{*}{color}
{color:#000000}* -p
reuters-kmeans-clusters/clusteredPoints/*{color}
{color:#000000}{*}Note:*{color} {color:#000000}{*}clusterdump{*}{color}
{color:#000000}{*}can do some analysis of the quality of clusters but is not
shown here.*{color}
# {color:#000000}{*}The{*}{color} {color:#000000}{*}clusteredPoints{*}{color}
{color:#000000}{*}dir has the docs mapped into clusters, and if you created
vectors with names (*{color}{color:#000000}{*}seq2sparse
\-nv{*}{color}{color:#000000}*) you’ll see file names. You also have the
distance from the centroid using the distance measure supplied to the
clustering driver. To look at this use seqdumper:*{color}
{color:#000000}{*}mahout seqdumper \{*}{color}
{color:#000000}* -s
reuters-kmeans-clusters/clusteredPoints/part-m-00000 \{*}{color}
{color:#000000}* \| more{*}{color}
{color:#000000}{*}You will see that the file contains key: clusterid, value: wt
= % likelihood the vector is in cluster, distance from centroid, named vector
belonging to the cluster, vector data.*{color}
{color:#000000}{*}For kmeans the likelihood will be 1.0 or 0. For
example:*{color}
{color:#000000}* Key: 21477: Value: wt: 1.0distance:
0.9420744909793364 vec: /-tmp/reut2-000.sgm-158.txt = \[372:0.318,
966:0.396, 3027:0.230, 8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264,
14334:0.270, 14371:0.413\]*{color}
{color:#000000}{*}Clusters, of course, do not have names. A simple solution is
to construct a name from the top terms in the centroid as they are output from
clusterdump.*{color}
h2. {color:#000000}{*}Calculate several similar docs to each doc in the
data{*}{color}
{color:#000000}This will take all docs in the data set then for each calculate
the 10 most similar docs. This is a “find more like this” type search but is
calculated in the background. This is surprisingly fast and requires only three
mapreduce passes.{color}# {color:#000000}First create a matrix from the
vectors:{color}
{color:#000000}mahout rowid \{color}
{color:#000000} -i
reuters-vectors/tfidf-vectors/part-r-00000{color}
{color:#000000} -o reuters-matrix{color}
{color:#000000}You’ll get output announcing the number of rows (documents) and
columns (total number of tokens in the dictionary) in the matrix. I looks will
look like this:{color}
{color:#000000}Wrote out matrix with 21578 rows and 19515 columns to
reuters-matrix/matrix{color}
{color:#000000}Save the number of columns since it is needed in the next step.
Also note that this creates a reuters-matrix/docIndex file where the rowids are
mapped to docids. In the case of this example it will be rowid-->file name
since we created named vectors in seq2sparse.{color}
{color:#000000}Note: This does not create a Mahout Matrix class but a sequence
file so use seqdumper to examine the results.{color}
# {color:#000000}Create a collection of similar docs for each row of the matrix
above:{color}
{color:#000000}mahout rowsimilarity \{color}
{color:#000000} -i reuters-named-matrix/matrix \{color}
{color:#000000} -o reuters-named-similarity \{color}
{color:#000000} -r 19515{color}
{color:#000000} --similarityClassname SIMILARITY_COSINE{color}
{color:#000000} -m 10{color}
{color:#000000} -ess{color}
{color:#000000}This will generate the 10 most similar docs to each doc in the
collection.{color}
# {color:#000000}Examine the similarity list:{color}
{color:#000000}mahout seqdumper \-s reuters-matrix/matrix \| more{color}
{color:#000000}Which should look something like this{color}
{color:#000000} {color}{color:#000000}Key: 0: Value:
{14458:0.2966480826934176,11399:0.30290014772966095,{color}
{color:#000000} 12793:0.22009858979452146,3275:0.1871791030103281,{color}
{color:#000000} 14613:0.3534278632679437,4411:0.2516380602790199,{color}
{color:#000000} 17520:0.3139731583634198,13611:0.18968888212315968,{color}
{color:#000000} 14354:0.17673965754661425,0:1.0000000000000004}{color}
{color:#000000}For each rowid there is a list of ten rowids and distances.
These correspond to documents and distances created by the
\--similarityCalssname. In this case they are cosines of the angle between doc
and similar doc. Look in the reuters-matrix/docIndex to find rowid to docid
mapping. It should look something like this:{color}
{color:#000000} Key: 0: Value: /-tmp/reut2-000.sgm-0.txt{color}
{color:#000000} Key: 1: Value: /-tmp/reut2-000.sgm-1.txt{color}
{color:#000000} Key: 2: Value:
/-tmp/reut2-000.sgm-10.txt{color}
{color:#000000} Key: 3: Value:
/-tmp/reut2-000.sgm-100.txt{color}
Change your notification preferences:
https://cwiki.apache.org/confluence/users/viewnotifications.action