I'm not arguing it is a performance improvement for sparse vectors, just that changing the class of the vector should not be necessary: if the vectors being clustered are dense then the cluster constructors should leave them dense. If the vectors that are being clustered are of a sparse variety, then the constructors would use the same flavor for the clusters. I know I missed the previous discussion, but this change is violating the contract of the API of the constructors and it makes debugging new test cases that use dense vector a PITA.

I'm still opposed to it.

On 4/27/10 12:13 AM, Robin Anil (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861272#action_12861272
 ]

Robin Anil commented on MAHOUT-297:
-----------------------------------

There was a discussion about this on the dev list. Check the util Vector 
Benchmarks and see how much faster clustering became after this change.  
Shouldnt necessarily be SeqAcc, if the points are all dense vectors. But 
obvious savings for sparse data is much better than the slight loss in 
performance for dense. (you will see that in the vector benchmarks code)



Canopy and Kmeans clustering slows down on using SeqAccVector for center
------------------------------------------------------------------------

                 Key: MAHOUT-297
                 URL: https://issues.apache.org/jira/browse/MAHOUT-297
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.4
            Reporter: Robin Anil
            Assignee: Robin Anil
             Fix For: 0.4

         Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
MAHOUT-297.patch, MAHOUT-297.patch




Reply via email to