[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779411#action_12779411 ]
Sean Owen commented on MAHOUT-165: ---------------------------------- +1 to making this a Mahout module. There's not much difference in making this a separate project that's Mahout-related, other than that is more work. mahout-colt shouldn't depend on mahout-core or anything, so it can still be used independently. I say commit this (with proper license attribution at the moment) so we can move forward! Sounds like Shashi's patch should then follow, using mahout-colt. The unit test failures sound like either an issue in the patch, or in the Vector.hashCode() implementation. Vector elements are ordered, so equals() must pay attention to order, therefore so must hashCode(). If that's not true it's a bug. If it's a bug, fix it in your patch. With these two, can we finally commit and close this monster issue? > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: colt.jar, mahout-165-18nov.patch, > mahout-165-trove.patch, MAHOUT-165-updated.patch, > MAHOUT-165-with-colt-module.patch, MAHOUT-165-with-colt.patch, > mahout-165.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.