[
https://issues.apache.org/jira/browse/MAHOUT-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated MAHOUT-5:
-----------------------------
Attachment: kmeans.zip
This is an implementation of k-means in non-MR form. It isn't intended for
immediate inclusion, but if it helps, please feel free to copy/paste anything
from this code.
There are several things that should be taken into account when implementing
kmeans:
- initial centroid vectors (several possibilities: random, max-difference-pick,
preclustering phase, subsampling and averaging),
- termination criterion (decrease of the global objective function, number of
iterations, combination of these),
- various optimizations. The document vectors are typically truncated (leaving
values of most significant dimensions for each document), sampled (leaving only
the most significant dimensions for all documents), or transformed (SVD or
other form of matrix decomposition, truncation of least significant dimensions
after the decomposition).
> Implement a k-means clustering prototype
> -----------------------------------------
>
> Key: MAHOUT-5
> URL: https://issues.apache.org/jira/browse/MAHOUT-5
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Jeff Eastman
> Assignee: Jeff Eastman
> Priority: Minor
> Attachments: kmeans.zip
>
>
> K-means clustering is closely related to Canopy clustering and often uses
> canopies to determine the initial clusters. I'd like to implement a k-means
> prototype and tests in the package org.apache.mahout.clustering.kmeans.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.