[
https://issues.apache.org/jira/browse/MAHOUT-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118474#comment-13118474
]
Eugene Kirpichov commented on MAHOUT-823:
-----------------------------------------
Sean, you're right - I was somehow blindly assuming that the sequential vector
is O( n ). Indeed benchmarks need to be done. If there'd be a way to make the
code simpler, sometimes much faster than the current version, and always not
much slower (probably getting "always not slower" would be harder, as the
constant factors measured in benchmarks would differ across particular
deployments) - that would be great.
> RandomAccessSparseVector.dot with another non-sequential vector can be
> extremely non-symmetric in its performance
> -----------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-823
> URL: https://issues.apache.org/jira/browse/MAHOUT-823
> Project: Mahout
> Issue Type: Improvement
> Components: Math
> Affects Versions: 0.5
> Reporter: Eugene Kirpichov
> Assignee: Sean Owen
> Labels: dot, dot-product, vector
> Fix For: 0.6
>
> Attachments: MAHOUT-823.patch
>
>
> http://codesearch.google.com/#6LK_nEANBKE/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java&l=172
> The complexity of the algorithm is O(num nondefault elements in this), while
> it could clearly be O(min(num nondefault in this, num nondefault in x)).
> This can be fixed by adding this code before line 189.
> {code}
> if(x.getNumNondefaultElements() < this.getNumNondefaultElements()) {
> return x.dot(this);
> }
> {code}
> An easy case where this asymmetry is very apparent and makes a huge
> difference in performance is K-Means clustering.
> In K-Means for high-dimensional points (e.g. those that arise in text
> retrieval problems), the centroids often have a huge number of non-zero
> components, whereas points have a small number of them.
> So, if you make a mistake and use centroid.dot(point) in your code for
> computing the distance, instead of point.dot(centroid), you end up with
> orders of magnitude worse performance (which is what we actually observed -
> the clustering time was a couple of minutes with this fix and over an hour
> without it).
> So, perhaps, if you make this fix, quite a few people who had a similar case
> but didn't notice it will suddenly have a dramatic performance increase :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira