[ 
https://issues.apache.org/jira/browse/MAHOUT-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated MAHOUT-5:
-----------------------------

    Attachment: kmeans.zip

This is an implementation of k-means in non-MR form. It isn't intended for 
immediate inclusion, but if it helps, please feel free to copy/paste anything 
from this code.

There are several things that should be taken into account when implementing 
kmeans:

- initial centroid vectors (several possibilities: random, max-difference-pick, 
preclustering phase, subsampling and averaging),

- termination criterion (decrease of the global objective function, number of 
iterations, combination of these),

- various optimizations. The document vectors are typically truncated (leaving 
values of most significant dimensions for each document), sampled (leaving only 
the most significant dimensions for all documents), or transformed (SVD or 
other form of matrix decomposition, truncation of least significant dimensions 
after the decomposition).

> Implement a k-means clustering prototype 
> -----------------------------------------
>
>                 Key: MAHOUT-5
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-5
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>            Priority: Minor
>         Attachments: kmeans.zip
>
>
> K-means clustering is closely related to Canopy clustering and often uses 
> canopies to determine the initial clusters. I'd like to implement a k-means 
> prototype and tests in the package org.apache.mahout.clustering.kmeans. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to