I'm not arguing it is a performance improvement for sparse vectors, just
that changing the class of the vector should not be necessary: if the
vectors being clustered are dense then the cluster constructors should
leave them dense. If the vectors that are being clustered are of a
sparse variety, then the constructors would use the same flavor for the
clusters. I know I missed the previous discussion, but this change is
violating the contract of the API of the constructors and it makes
debugging new test cases that use dense vector a PITA.
I'm still opposed to it.
On 4/27/10 12:13 AM, Robin Anil (JIRA) wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861272#action_12861272
]
Robin Anil commented on MAHOUT-297:
-----------------------------------
There was a discussion about this on the dev list. Check the util Vector
Benchmarks and see how much faster clustering became after this change.
Shouldnt necessarily be SeqAcc, if the points are all dense vectors. But
obvious savings for sparse data is much better than the slight loss in
performance for dense. (you will see that in the vector benchmarks code)
Canopy and Kmeans clustering slows down on using SeqAccVector for center
------------------------------------------------------------------------
Key: MAHOUT-297
URL: https://issues.apache.org/jira/browse/MAHOUT-297
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
Fix For: 0.4
Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch,
MAHOUT-297.patch, MAHOUT-297.patch