[ 
https://issues.apache.org/jira/browse/MAHOUT-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686325#comment-13686325
 ] 

Yiqun Hu commented on MAHOUT-1214:
----------------------------------

@Robin, That's what I try to do: create a new test case to reproduce this bug 
and then raise a ticket. But I can explain a little bit here. Different from 
v0.7, v0.8 start to use aggregate to finish the vector operation. Example is 
like "x.aggregate(y, Functions.PLUS, Functions.MULT)" for computing the dot 
product. The implementation of the aggregate function in AbstractVector.java 
uses VectorBinaryAggregate.aggregateBest. What it tries to do is find the 
optimal operation and do the aggregation. One of the operation is 
AggregateAllLoop (in VectorBinaryAggregate.java).

The bug is in this place. The implmentation of aggregate function is as follows
@Override
public double aggregate(Vector x, Vector y, DoubleDoubleFunction fa, 
DoubleDoubleFunction fc) {
  double result = fc.apply(x.getQuick(0), y.getQuick(0));
  for (int i = 1; i < x.size(); ++i) {
    result = fa.apply(result, fc.apply(x.getQuick(i), y.getQuick(i)));
  }
  return result;
}

Notice that it loop over the index from 0 to size()-1 for the vector. In the 
case of the vector is a RandomAccessSparseVector (e.g. {3: 0.11, 4: 0.38, 
5:0.2}, size()=3), this aggregate function will try to access the element whose 
indices are {0, 1, 2} and all values are 0.0. Because nonzero elements are not 
at the beginning. For our spectral kmeans example, this bug occurs when we 
calculate the dot product between two exactly same cluster centroids (in eigen 
space) and output dot prodct as 0.0, which is actually should be the norm of 
the centroids (1.0). This bug will occurs whenever computing the aggregate 
function of two sparse vector.

Is it clear now?  
                
> Improve the accuracy of the Spectral KMeans Method
> --------------------------------------------------
>
>                 Key: MAHOUT-1214
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1214
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.7
>         Environment: Mahout 0.7
>            Reporter: Yiqun Hu
>            Assignee: Robin Anil
>              Labels: clustering, improvement
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1214.patch, matrix_1, matrix_2
>
>
> The current implementation of the spectral KMeans algorithm (Andrew Ng. etc. 
> NIPS 2002) in version 0.7 has two serious issues. These two incorrect 
> implementations make it fail even for a very obvious trivial dataset. We have 
> implemented a solution to resolve these two issues and hope to contribute 
> back to the community.
> # Issue 1: 
> The EigenVerificationJob in version 0.7 does not check the orthogonality of 
> eigenvectors, which is necessary to obtain the correct clustering results for 
> the case of K>1; We have an idea and implementation to select based on 
> cosAngle/orthogonality;
> # Issue 2:
> The random seed initialization of KMeans algorithm is not optimal and 
> sometimes a bad initialization will generate wrong clustering result. In this 
> case, the selected K eigenvector actually provides a better way to initalize 
> cluster centroids because each selected eigenvector is a relaxed indicator of 
> the memberships of one cluster. For every selected eigenvector, we use the 
> data point whose eigen component achieves the maximum absolute value. 
> We have already verified our improvement on synthetic dataset and it shows 
> that the improved version get the optimal clustering result while the current 
> 0.7 version obtains the wrong result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to