[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Hu updated MAHOUT-1214:
-----------------------------

    Status: Patch Available  (was: Open)

This is the patch based on the version 0.7 (the source code is checkout from 
tag 0.7). We have made three changes
1.      We fixed a bug in SpectralKMeansDriver.java (line 146)
LanczosState state = new LanczosState(L, overshoot, solver.getInitialVector(L));
2.      Given the number K of the requested eigenvectors, we add a mechanism to 
return the most orthogonal set of K eigenvectors. By sorting the eigenvectors 
by its associated eigenvalues, it starts from the eigenvector with the largest 
eigenvalue. Then iteratively add the most orthogonal eigenvectors in to the set;
related files: EigenVerificationJob.java
3.      Instead of initialise KMeans using random seed, we use the data point 
with the largest value in every selected eigenvector as the seed;
Related files: EigenSeedGenerator.java (new file)
4.      We add the support of directly using data id to represent the affinity 
matrix;
Related files: AffinityMatrixGenericInputMapper.java(new file), 
AffinityMatrixGenericInputReducer.java(new file) , AffinityMatrixInputJob.java
 
We also provide a naive example to demonstrate the improvement. This is a 
dataset of 7 points. The affinity matrix is as follows:
 
[ 1    0.8    0      0       0     0    0
 0.8    1     0      0       0     0    0
  0     0     1     0.9     0.7    0    0
  0     0    0.9     1      0.8    0    0
  0     0    0.7    0.8      1     0    0
  0     0     0      0       0     1   0.7
  0     0     0      0       0    0.7   1 ]
 
You can see the cross-cluster similarities are all zero and intra-cluster 
similarities are very close to 1. It is very clear there are 3 clusters.
 
The original v0.7 only support to represent this affinity matrix using the 
format of (#row, #col, similarity). And the output is 
 
matrix_1 result: (0.7 version)
kmeans.SpectralKMeansDriver: 0: 1
kmeans.SpectralKMeansDriver: 1: 1
kmeans.SpectralKMeansDriver: 2: 5
kmeans.SpectralKMeansDriver: 3: 5
kmeans.SpectralKMeansDriver: 4: 0
kmeans.SpectralKMeansDriver: 5: 1
kmeans.SpectralKMeansDriver: 6: 1
 
You can see the clustering result is far away from correct.
 
Our improvement can take either the format of (#row, #col, similarity) or 
(data1-id, data2-id, similarity). For example two types of affinity record can 
be supported:
(0, 1, 0.8) //the similarity in the first row and second column is 0.8
(a, b, 0.8) // the similarity between data a and data b is 0.8
 
The output we obtain are
Format 1 
kmeans.SpectralKMeansDriver: 0: 1
kmeans.SpectralKMeansDriver: 1: 1
kmeans.SpectralKMeansDriver: 2: 0
kmeans.SpectralKMeansDriver: 3: 0
kmeans.SpectralKMeansDriver: 4: 0
kmeans.SpectralKMeansDriver: 5: 2
kmeans.SpectralKMeansDriver: 6: 2
 
Format 2
kmeans.SpectralKMeansDriver: a: 1
kmeans.SpectralKMeansDriver: b: 1
kmeans.SpectralKMeansDriver: c: 0
kmeans.SpectralKMeansDriver: d: 0
kmeans.SpectralKMeansDriver: e: 0
kmeans.SpectralKMeansDriver: f: 2
kmeans.SpectralKMeansDriver: g: 2
 
Note that in the second case user don't need to map the data to integer index. 
We will do it automatically. And the clustering result now is 100% correct.
                
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
>                 Key: MAHOUT-1214
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Mahout 0.7
>            Reporter: Yiqun Hu
>              Labels: clustering, improvement
>             Fix For: Backlog
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to