Quick tour of text analysis using the Mahout command linePage edited by Suneel MarthiChanges (1)
Full ContentIntroductionThis is a concise quick tour of using the mahout command line to generate text analysis data. It follows examples from the Mahout in Action book and uses the Reuters-21578 data set. This is one simple path through vectorizing text, creating clusters and calculating similar documents. The examples will work locally or distributed on a hadoop cluster. With the small data set provided a local installation is probably fast enough. This walkthrough was originally written for Mahout 0.6 CLI and has been updated to the >= 0.7. When in doubt executing any command without parameters will output help. Generate Mahout vectors from text Get the Reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/) files and extract them in “./reuters”. They are in SGML format. Mahout will also create sequence files from raw text and other formats. At the end of this section you will have the text files turned into vectors, which are basically lists of weighted token. The weights are calculated to indicate the importance of each token.
Cluster documents using kmeansClustering documents can be done with one of several clustering algorithms in Mahout. Perhaps the best know is kmeans, which will drop documents into k categories. You have to supply k as input along with the vectors. The output is k centroids (vectors) for each cluster and optionally the documents assigned to each cluster.
Calculate several similar docs to each doc in the dataThis will take all docs in the data set and for each calculate the 10 most similar docs. This can be used for a "give me more like this" feature. The algorithm is fairly fast and requires only three mapreduce passes.
ConclusionA wide variety of tasks can be performed from the command line of Mahout. Many parameters available in the Java API are supported so it is a good way to get an idea of how Mahout works and will give a basis for tuning your own use.
Stop watching space
|
Change email notification preferences
View Online
|
View Changes
|
Add Comment
|
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... confluence
- [CONF] Apache Mahout > Quick tour of text an... Suneel Marthi (Confluence)
