Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Creating Vectors from Text 
(https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text)


Edited by Grant Ingersoll:
---------------------------------------------------------------------
+*Mahout_0.2*+
{toc:style=disc|indent=20px}

h1. Introduction

For clustering documents it is usually necessary to convert the raw text into 
vectors that can then be consumed by the clustering [Algorithms].  These 
approaches are described below.

h1. From Lucene

*NOTE: Your Lucene index must be created with the same version of Lucene used 
in Mahout.  Check Mahout's POM file to get the version number, otherwise you 
will likely get "Exception in thread "main" 
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" as 
an error.*

Mahout has utilities that allow one to easily produce Mahout Vector 
representations from a Lucene (and Solr, since they are they same) index.

For this, we assume you know how to build a Lucene/Solr index.  For those who 
don't, it is probably easiest to get up and running using 
[Solr|http://lucene.apache.org/solr] as it can ingest things like PDFs, XML, 
Office, etc. and create a Lucene index.  For those wanting to use just Lucene, 
see the Lucene [website|http://lucene.apache.org/java] or check out _Lucene In 
Action_ by Erik Hatcher, Otis Gospodnetic and Mike McCandless.

To get started, make sure you get a fresh copy of Mahout from 
[SVN|http://cwiki.apache.org/MAHOUT/buildingmahout.html] and are comfortable 
building it. It defines interfaces and implementations for efficiently 
iterating over a Data Source (it only supports Lucene currently, but should be 
extensible to databases, Solr, etc.) and produces a Mahout Vector file and term 
dictionary which can then be used for clustering.   The main code for driving 
this is the Driver program located in the org.apache.mahout.utils.vectors 
package.  The Driver program offers several input options, which can be 
displayed by specifying the --help option.  Examples of running the Driver are 
included below:

h2. Generating an output file from a Lucene Index

{noformat}
$MAHOUT_HOME/bin/mahout lucene.vector <PATH TO DIRECTORY CONTAINING LUCENE 
INDEX> \
   --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> 
--dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \
   <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField 
<Name of the idField in the Lucene index>>
{noformat}

h3. Create 50 Vectors from an Index 
{noformat}
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index 
--field body \
    --dictOut <PATH>/solr/wikipedia/dict.txt --output 
<PATH>/solr/wikipedia/out.txt --max 50
{noformat}
This uses the index specified by --dir and the body field in it and writes out 
the info to the output dir and the dictionary to dict.txt.  It only outputs 50 
vectors.  If you don't specify --max, then all the documents in the index are 
output.

h3. Normalize 50 Vectors from a Lucene Index using the [L_2 
Norm|http://en.wikipedia.org/wiki/Lp_space]
{noformat}
$MAHOUT_HOME/bin/mahout lucene.vector --dir <PATH>/wikipedia/solr/data/index 
--field body \
      --dictOut <PATH>/solr/wikipedia/dict.txt --output 
<PATH>/solr/wikipedia/out.txt --max 50 --norm 2
{noformat}

h1. From Directory of Text documents
Mahout has utilities to generate Vectors from a directory of text documents. 
Before creating the vectors, you need to convert the documents to SequenceFile 
format. SequenceFile is a hadoop class which allows us to write arbitary 
key,value pairs into it. The DocumentVectorizer requires the key to be a Text 
with a unique document id, and value to be the Text content in UTF-8 format.

You may find Tika (http://lucene.apache.org/tika) helpful in converting binary 
documents to text.

h2. Converting directory of documents to SequenceFile format
Mahout has a nifty utility which reads a directory path including its 
sub-directories and creates the SequenceFile in a chunked manner for us. the 
document id generated is <PREFIX><RELATIVE PATH FROM PARENT>/document.txt

>From the examples directory run
{noformat}
$MAHOUT_HOME/bin/mahout seqdirectory \
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY> \
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}> \
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64> \
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>
{noformat}

h2. Creating Vectors from SequenceFile

+*Mahout_0.3*+

>From the sequence file generated from the above step run the following to 
>generate vectors. 
{noformat}
$MAHOUT_HOME/bin/mahout seq2sparse \
-i <PATH TO THE SEQUENCEFILES> -o <OUTPUT DIRECTORY WHERE VECTORS AND 
DICTIONARY IS GENERATED> \
<-wt <WEIGHTING METHOD USED> {tf|tfidf}> \
<-chunk <MAX SIZE OF DICTIONARY CHUNK IN MB TO KEEP IN MEMORY> 100> \
<-a <NAME OF THE LUCENE ANALYZER TO TOKENIZE THE DOCUMENT> 
org.apache.lucene.analysis.standard.StandardAnalyzer> \
<--minSupport <MINIMUM SUPPORT> 2> \
<--minDF <MINIMUM DOCUMENT FREQUENCY> 1> \
<--maxDFPercent <MAX PERCENTAGE OF DOCS FOR DF. VALUE BETWEEN 0-100> 99> \
<--norm <REFER TO L_2 NORM ABOVE>{INF|integer >= 0}>"
<-seq <Create SequentialAccessVectors>{false|true required for running some 
algorithms(LDA,Lanczos)}>"
{noformat}

--minSupport is the min frequency for the word to  be considered as a feature. 
--minDF is the min number of documents the word needs to be in
--maxDFPercent is the max value of the expression (document frequency of a 
word/total number of document) to be considered as good feature to be in the 
document. This helps remove high frequency features like stop words

h1. Background

* 
http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
* 
http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

h1. From a Database

+*TODO:*+

h1. Other

h2. Converting existing vectors to Mahout's format

If you are in the happy position to already own a document (as in: texts, 
images or whatever item you wish to treat) processing pipeline, the question 
arises of how to convert the vectors into the Mahout vector format. Probably 
the easiest way to go would be to implement your own Iterable<Vector> (called 
VectorIterable in the example below) and then reuse the existing VectorWriter 
classes:

{code}
VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, 
configuration, outfile, LongWritable.class, SparseVector.class);
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
{code}


Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to