[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robin Anil updated MAHOUT-1214:
-------------------------------
Attachment: MAHOUT-1214.patch
Test case which shows the bug in the new AggregateBinaryFunction
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.7
> Environment: Mahout 0.7
> Reporter: Yiqun Hu
> Assignee: Robin Anil
> Labels: clustering, improvement
> Fix For: 0.8
>
> Attachments: MAHOUT-1214.patch, MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc.
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect
> implementations make it fail even for a very obvious trivial dataset. We have
> implemented a solution to resolve these two issues and hope to contribute
> back to the community.
> # Issue 1:
> The EigenVerificationJob in version 0.7 does not check the orthogonality of
> eigenvectors, which is necessary to obtain the correct clustering results for
> the case of K>1; We have an idea and implementation to select based on
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and
> sometimes a bad initialization will generate wrong clustering result. In this
> case, the selected K eigenvector actually provides a better way to initalize
> cluster centroids because each selected eigenvector is a relaxed indicator of
> the memberships of one cluster. For every selected eigenvector, we use the
> data point whose eigen component achieves the maximum absolute value.
> We have already verified our improvement on synthetic dataset and it shows
> that the improved version get the optimal clustering result while the current
> 0.7 version obtains the wrong result.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira