[CONF] Apache Lucene Mahout: ClusteringYourData (page edited)

confluence Wed, 17 Jun 2009 06:35:34 -0700

ClusteringYourData (MAHOUT) edited by Grant Ingersoll
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData
   Changes: 
http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=120583&originalVersion=3&revisedVersion=4

Content:
---------------------------------------------------------------------

+*Mahout_0.2*+

After you've done the [QuickStart] and are familiar with the basics of Mahout,
it is time to cluster your own data.

The following pieces *may* be useful for in getting started:

h1. Input

For starters, you will need your data in an appropriate Vector format (which
has changed since Mahout 0.1)

h2. Text Preparation

* See [Creating Vectors from Text]
*
http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

h1. Running the Process

+*TODO*+ FILL ME IN
h2. Canopy

Background: [canopy | Canopy Clustering]

h2. kMeans

Background: [k-means]

h2. Dirichlet

Background: [dirichlet | Dirichlet Process Clustering]

h2. Mean-shift

Background: [meanshift | Mean Shift]

h1. Retrieving the Output

+*TODO*+

h1. Validating the Output

>From Ted Dunning's response on See
>http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
{quote}
A principled approach to cluster evaluation is to measure how well the
cluster membership captures the structure of unseen data. A natural measure
for this is to measure how much of the entropy of the data is captured by
cluster membership. For k-means and its natural L_2 metric, the natural
cluster quality metric is the squared distance from the nearest centroid
adjusted by the log_2 of the number of clusters. This can be compared to
the squared magnitude of the original data or the squared deviation from the
centroid for all of the data. The idea is that you are changing the
representation of the data by allocating some of the bits in your original
representation to represent which cluster each point is in. If those bits
aren't made up by the residue being small then your clustering is making a
bad trade-off.

In the past, I have used other more heuristic measures as well. One of the
key characteristics that I would like to see out of a clustering is a degree
of stability. Thus, I look at the fractions of points that are assigned to
each cluster or the distribution of distances from the cluster centroid.
These values should be relatively stable when applied to held-out data.

For text, you can actually compute perplexity which measures how well
cluster membership predicts what words are used. This is nice because you
don't have to worry about the entropy of real valued numbers.

Manual inspection and the so-called laugh test is also important. The idea
is that the results should not be so ludicrous as to make you laugh.
Unfortunately, it is pretty easy to kid yourself into thinking your system
is working using this kind of inspection. The problem is that we are too
good at seeing (making up) patterns.
{quote}

h1. References

* [Mahout archive
references|http://www.lucidimagination.com/search/p:mahout?q=clustering]

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence

[CONF] Apache Lucene Mahout: ClusteringYourData (page edited)

Reply via email to