Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Quick tour of text analysis using the Mahout command line 
(https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line)


Edited by Pat Ferrel:
---------------------------------------------------------------------
h2. {color:#000000}{*}Introduction{*}{color}

{color:#000000}{*}This is a concise quick tour using the mahout command line to 
generate text analysis data. It follows examples from the{*}{color} [Mahout in 
Action|http://manning.com/owen/]{color:#000000}{*}book and uses the 
Reuters-21578 data set. This is one simple path through vectorizing text, 
creating clusters and calculating similar documents. The examples will work 
locally or distributed on a hadoop cluster. With the small data set provided a 
local installation is probably fast enough.*{color}


h2. {color:#000000}{*}Generate Mahout vectors from text{*}{color}

{color:#000000}{*}Get the{*}{color} 
[Reuters-21578|http://www.daviddlewis.com/resources/testcollections/reuters21578/]
 
{color:#000000}*(*{color}{color:#000000}*[http://www.daviddlewis.com/resources/testcollections/reuters21578/]*{color}{color:#000000}*)
 files and extract them in “./reuters”. They are in SGML format. Mahout will 
also create sequence files from raw text and other formats. At the end of this 
section you will have the text files turned into vectors, which are basically 
lists of weighted token. The weights are calculated to indicate the importance 
of each token.*{color}
# {color:#000000}{*}Convert from SGML to text:*{color}
{code}
mvn -e -q exec:java 
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" 
-Dexec.args="reuters/ reuters-extracted/"
{code}
{color:#000000}{*}If you plan to run this example on a hadoop cluster you will 
need to copy the files to HDFS, which is not covered here.*{color}
# {color:#000000}Now turn raw text in a directory into mahout sequence 
files:{color}
{code}
mahout seqdirectory \{color}
   -c UTF-8 \{color}
   -i examples/reuters-extracted/ \{color}
   -o reuters-seqfiles
{code}
# {color:#000000}Examine the sequence files with seqdumper:{color}
{color:#000000}mahout seqdumper \-s reuters-seqfiles/chunk-0 \| more{color}
{color:#000000}you should see something like this:{color}
{code}
Input Path: reuters-seqfiles/chunk-0
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.Text
Key: /-tmp/reut2-000.sgm-0.txt: Value: 26-FEB-1987 15:01:01.79

BAHIA COCOA REVIEW

Showers continued throughout the week in the Bahia cocoa zone, alleviating the 
drought since early January and improving prospects for the coming temporao, 
although normal …
{code}

# {color:#000000}Create tfidf weighted vectors.{color}
{code}
mahout seq2sparse \{color}
   -i reuters-seqfiles/ \{color}
   -o reuters-vectors/ \{color}
   -ow -chunk 100 \{color}
   -x 90 \{color}
   -seq \{color}
   -ml 50 \{color}
   -n 2 \{color}
   -nv
{code}
{color:#000000}This uses the default analyzer and default TFIDF weighting, \-n 
2 is good for cosine distance, which we are using in clustering and for 
similarity, \-x 90 meaning that if a token appears in 90% of the docs it is 
considered a stop word, \-nv to get named vectors making further data files 
easier to inspect.{color}
# {color:#000000}Examine the vectors if you like but they are not really human 
readable...{color}
{code}
mahout seqdumper -s reuters-seqfiles/part-r-00000
{code}
# {color:#000000}Examine the tokenized docs to make sure the analyzer is 
filtering out enough (note that the rest of this example used a more 
restrictive lucene analyzer and not the default so your result may vary):{color}

{code}
mahout seqdumper \{color}
   -s reuters-vectors/tokenized-documents/part-m-00000
{code}{color:#000000}This should show each doc with nice clean tokenized 
text.{color}
# {color:#000000}Examine the dictionary. It maps token id to token text.{color}
{code}
mahout seqdumper \{color}
   -s reuters-vectors/dictionary.file-0 \{color}
   | more
{code}

h2. {color:#000000}{*}Cluster documents using kmeans{*}{color}

#  
{code}
mahout kmeans \
   -i reuters-vectors/tfidf-vectors/ \
   -c reuters-kmeans-centroids \
   -cl \
   -o reuters-kmeans-clusters \
   -k 20 \
   -ow \
   -x 10 \
   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
{code}
# {color:#000000}{*}Examine the clusters and perhaps even do some anaylsis of 
how good the clusters are:*{color}
{code}
mahout clusterdump \{{color:#000000}}
   -d reuters-vectors/dictionary.file-0 \{{color:#000000}}
   -dt sequencefile \{{color:#000000}}
   -s reuters-kmeans-clusters/clusters-3-final/part-r-00000 \{{color:#000000}}
   -n 20 \{{color:#000000}}
   -b 100 \{{color:#000000}}
   -p reuters-kmeans-clusters/clusteredPoints/
{code}
{color:#000000}{*}Note:*{color} {color:#000000}{*}clusterdump{*}{color} 
{color:#000000}{*}can do some analysis of the quality of clusters but is not 
shown here.*{color}
# {color:#000000}{*}The{*}{color} {color:#000000}{*}clusteredPoints{*}{color} 
{color:#000000}{*}dir has the docs mapped into clusters, and if you created 
vectors with names (*{color}{color:#000000}{*}seq2sparse 
\-nv{*}{color}{color:#000000}*) you’ll see file names. You also have the 
distance from the centroid using the distance measure supplied to the 
clustering driver. To look at this use seqdumper:*{color}
{code}
mahout seqdumper \{{color:#000000}}
   -s reuters-kmeans-clusters/clusteredPoints/part-m-00000 \{{color:#000000}}
   | more
{code}
{color:#000000}{*}You will see that the file contains key: clusterid, value: wt 
= % likelihood the vector is in cluster, distance from centroid, named vector 
belonging to the cluster, vector data.*{color}
{color:#000000}{*}For kmeans the likelihood will be 1.0 or 0. For 
example:*{color}
{code}
Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  
vec: /-tmp/reut2-000.sgm-158.txt = [372:0.318,
966:0.396, 3027:0.230, 8816:0.452, 8868:0.308,
13639:0.278, 13648:0.264, 14334:0.270, 14371:0.413]
{code}
{color:#000000}{*}Clusters, of course, do not have names. A simple solution is 
to construct a name from the top terms in the centroid as they are output from 
clusterdump.*{color}

h2. {color:#000000}{*}Calculate several similar docs to each doc in the 
data{*}{color}

{color:#000000}This will take all docs in the data set then for each calculate 
the 10 most similar docs. This is a “find more like this” type search but is 
calculated in the background. This is surprisingly fast and requires only three 
mapreduce passes.{color}
# {color:#000000}First create a matrix from the vectors:{color}
{color:#000000}mahout rowid \{color}{color}
{color:#000000}   -i reuters-vectors/tfidf-vectors/part-r-00000{color}
{color:#000000}   -o reuters-matrix{color}
{color:#000000}You’ll get output announcing the number of rows (documents) and 
columns (total number of tokens in the dictionary) in the matrix. I looks will 
look like this:{color}
{color:#000000}Wrote out matrix with 21578 rows and 19515 columns to 
reuters-matrix/matrix{color}
{color:#000000}Save the number of columns since it is needed in the next step. 
Also note that this creates a reuters-matrix/docIndex file where the rowids are 
mapped to docids. In the case of this example it will be rowid-->file name 
since we created named vectors in seq2sparse.{color}
{color:#000000}Note: This does not create a Mahout Matrix class but a sequence 
file so use seqdumper to examine the results.{color}
# {color:#000000}Create a collection of similar docs for each row of the matrix 
above:{color}
{color:#000000}mahout rowsimilarity \{color}{color}
{color:#000000}   -i reuters-named-matrix/matrix \{color}{color}
{color:#000000}   -o reuters-named-similarity \{color}{color}
{color:#000000}   -r 19515{color}
{color:#000000}   --similarityClassname SIMILARITY_COSINE{color}
{color:#000000}   -m 10{color}
{color:#000000}   -ess{color}
{color:#000000}This will generate the 10 most similar docs to each doc in the 
collection.{color}

# {color:#000000}Examine the similarity list:{color}
{color:#000000}mahout seqdumper \-s reuters-matrix/matrix \| more{color}
{color:#000000}Which should look something like this{color}
{color:#000000}   {color}{color:#000000}Key: 0: Value: 
{14458:0.2966480826934176,11399:0.30290014772966095,{color}
{color:#000000}  12793:0.22009858979452146,3275:0.1871791030103281,{color}
{color:#000000}  14613:0.3534278632679437,4411:0.2516380602790199,{color}
{color:#000000}  17520:0.3139731583634198,13611:0.18968888212315968,{color}
{color:#000000}  14354:0.17673965754661425,0:1.0000000000000004}{color}
{color:#000000}For each rowid there is a list of ten rowids and distances. 
These correspond to documents and distances created by the 
\--similarityCalssname. In this case they are cosines of the angle between doc 
and similar doc. Look in the reuters-matrix/docIndex to find rowid to docid 
mapping. It should look something like this:{color}
{color:#000000}   Key: 0: Value: /-tmp/reut2-000.sgm-0.txt{color}
{color:#000000}   Key: 1: Value: /-tmp/reut2-000.sgm-1.txt{color}
{color:#000000}   Key: 2: Value: 
/-tmp/reut2-000.sgm-10.txt{color}
{color:#000000}   Key: 3: Value: 
/-tmp/reut2-000.sgm-100.txt{color}

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

Reply via email to