[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779650#action_12779650 ]
Ted Dunning commented on MAHOUT-165: ------------------------------------ bq. Then my comment is just that the hash code can be defined such that it doesn't need elements in order to be computed, so shouldn't. java.util.AbstractMap seems to think that the sum of the hash codes of all entries is good enough. And that is good enough for me. Using a commutative operator gets rid of the need for ordering. I would propose basically the same thing: hashcode(Matrix) = sum_row hashcode(row) hashcode(Vector) = sum_(i,v_i) hashcode(i) + hashcode(v_i) for hashing integers, the integer itself is fine. For doubles, doubleToLongBits provides what we need (just xor the two halves of the resulting long). > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: colt.jar, mahout-165-18nov-updated.patch, > mahout-165-18nov.patch, mahout-165-trove.patch, MAHOUT-165-updated.patch, > MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, > mahout-165.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.