[
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13675861#comment-13675861
]
Yiqun Hu commented on MAHOUT-1214:
----------------------------------
@Shannon, yep, that's what I aim for. The patch won't come in today. Initially,
I make my changed based on released 0.7 version. But to generate the patch, I
need to based on svn trunk version. I found a lot of change in this bleeding
version, so I will integrate my change and get the patch ready for you guys to
review by this week.
One more question, I also have a small enhancement from end user perspective:
the current input to Spectral KMeans is the affinity matrix where each record
is in the format of (#row, #col, similarity). However, for our application and
I believe many application, the input often is the output of a previous
mapreduce job, where the pairwise similarity is computed and stored in the
HDFS. Also very often, the similarity results is (#data-id1, #data-id2,
similarity). A separate job to map the id to index of row and col is
inconvenient for end user and sometimes difficult. So I created a simple but
effective way for Spectral KMeans job to accept the affinity matrix in the
format of data ids. The mapping will be done internally and the output result
of clustering is also automatically mapped to data ids.
But it is not perfect, the affinity matrix input cannot support sparse
representation (e.g. (#data1, #data2, 0) need to be explicitly written). If you
have even better idea on it, just let me know. I will include this change in my
patch as well as to contribute back to the community.
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
> Key: MAHOUT-1214
> URL: https://issues.apache.org/jira/browse/MAHOUT-1214
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.7
> Environment: Mahout 0.7
> Reporter: Yiqun Hu
> Labels: clustering, improvement
> Fix For: Backlog
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc.
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect
> implementations make it fail even for a very obvious trivial dataset. We have
> implemented a solution to resolve these two issues and hope to contribute
> back to the community.
> # Issue 1:
> The EigenVerificationJob in version 0.7 does not check the orthogonality of
> eigenvectors, which is necessary to obtain the correct clustering results for
> the case of K>1; We have an idea and implementation to select based on
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and
> sometimes a bad initialization will generate wrong clustering result. In this
> case, the selected K eigenvector actually provides a better way to initalize
> cluster centroids because each selected eigenvector is a relaxed indicator of
> the memberships of one cluster. For every selected eigenvector, we use the
> data point whose eigen component achieves the maximum absolute value.
> We have already verified our improvement on synthetic dataset and it shows
> that the improved version get the optimal clustering result while the current
> 0.7 version obtains the wrong result.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira