[CONF] Apache Lucene Mahout > Creating Vectors from Text

confluence Sat, 19 Dec 2009 08:05:31 -0800

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: Creating Vectors from Text 
(http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text)


Change Comment:
---------------------------------------------------------------------
typo repair

Edited by Benson Margulies:
---------------------------------------------------------------------
+*Mahout_0.2*+

h1. Introduction

For clustering documents it is usually necessary to convert the raw text into 
vectors that can then be consumed by the clustering [Algorithms].  These 
approaches are described below.

h1. From Lucene

Mahout has utilities that allow one to easily produce Mahout Vector 
representations from a Lucene (and Solr, since they are they same) index.

For this, we assume you know how to build a Lucene/Solr index.  For those who 
don't, it is probably easiest to get up and running using 
[Solr|http://lucene.apache.org/solr] as it can ingest things like PDFs, XML, 
Office, etc. and create a Lucene index.  For those wanting to use just Lucene, 
see the Lucene [website|http://lucene.apache.org/java] or check out _Lucene In 
Action_ by Erik Hatcher, Otis Gospodnetic and Mike McCandless.

To get started, make sure you get a fresh copy of Mahout from 
[SVN|http://cwiki.apache.org/MAHOUT/buildingmahout.html] and are comfortable 
building it.  You will also need to [apply the 
patch|http://cwiki.apache.org/MAHOUT/howtocontribute.html] on MAHOUT-126.  This 
patch creates a "utils" module in Mahout at the same level as the Core that 
defines utilities for working with Mahout.  In this case, it defines interfaces 
and implementations for efficiently iterating over a Data Source (it only 
supports Lucene currently, but should be extensible to databases, Solr, etc.) 
and produces a Mahout Vector file and term dictionary which can then be used 
for clustering.   The main code for driving this is the Driver program located 
in the org.apache.mahout.utils.vectors package.  The Driver program offers 
several input options, which can be displayed by specifying the --help option.  
Examples of running the Driver are included below:

h2. Generating an output file from a Lucene Index

{noformat}
java -cp <CLASSPATH> org.apache.mahout.utils.vectors.lucene.Driver --dir <PATH 
TO DIRECTORY CONTAINING LUCENE INDEX> \
   --output <PATH TO OUTPUT LOCATION> --field <NAME OF FIELD IN INDEX> 
--dictOut <PATH TO FILE TO OUTPUT THE DICTIONARY TO] \
   <--max <Number of vectors to output>> <--norm {INF|integer >= 0}> <--idField 
<Name of the idField in the Lucene index>>
{noformat}

h3. Create 50 Vectors from an Index 
{noformat}
org.apache.mahout.utils.vectors.lucene.Driver --dir 
<PATH>/wikipedia/solr/data/index --field body \
    --dictOut <PATH>/solr/wikipedia/dict.txt --output 
<PATH>/solr/wikipedia/out.txt --max 50
{noformat}
This uses the index specified by --dir and the body field in it and writes out 
the info to the output dir and the dictionary to dict.txt.  It only outputs 50 
vectors.  If you don't specify --max, then all the documents in the index are 
output.

h3. Normalize 50 Vectors from an Index using the [L_2 
Norm|http://en.wikipedia.org/wiki/Lp_space]
{noformat}
org.apache.mahout.utils.vectors.lucene.Driver --dir 
<PATH>/wikipedia/solr/data/index --field body \
      --dictOut <PATH>/solr/wikipedia/dict.txt --output 
<PATH>/solr/wikipedia/out.txt --max 50 --norm 2
{noformat}

h2. Background

* 
http://www.lucidimagination.com/search/document/3d8310376b6cdf6b/centroid_calculations_with_sparse_vectors#86a54dae9052d68c
* 
http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

h1. From a Database

+*TODO:*+

h1. Other

h2. Converting existing vectors to Mahout's format

If you are in the happy position to already own a document (as in: texts, 
images or whatever item you wish to treat) processing pipeline, the question 
arises of how to convert the vectors into the Mahout vector format. Probably 
the easiest way to go would be to implement your own Iterable<Vector> (called 
VectorIterable in the example below) and then reuse the existing VectorWriter 
classes:

{code}
VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, 
configuration, outfile, LongWritable.class, SparseVector.class);
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
{code}


Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Lucene Mahout > Creating Vectors from Text

Reply via email to