[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Drew Farris updated MAHOUT-165: ------------------------------- Attachment: MAHOUT-165-with-colt-module.patch As discussed earlier I've taken Jake's patch and moved all of the colt source into a new module under mahout. After this patch is applied, colt can be built from the top level package adding -Pcolt to the maven invocation, e.g: {{mvn clean install -Pcolt}}. The result is {{mahout-colt-0.3-SNAPSHOT.jar}}, which in turn could be added as a dependency in core, but I'll leave that another patch once there's stuff in core that actually uses colt. I've removed the concurrent dependency from the core pom added by Jake's patch. Other than that and the relocation everything else is the same as waht Jake provided. > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: colt.jar, mahout-165-trove.patch, > MAHOUT-165-updated.patch, MAHOUT-165-with-colt-module.patch, > MAHOUT-165-with-colt.patch, mahout-165.patch, MAHOUT-165.patch, > mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.