[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751385#action_12751385 ]
Sean Owen commented on MAHOUT-165: ---------------------------------- Wait a sec, I thought we had concluded that we *cannot* use Trove. It was Colt that had a portion which was licensed acceptably. Are you saying these errors occur before you change? I don't see these failures in head. The first error -- can't tell you why it happens but can explain it more, if that's what you're asking. Zero and negative zero are actually different doubles, and they aren't ==. Somehow the computation has changed in your patch such that a result ends up zero, but negative zero actually. One might say the test should actually not compare doubles for exact equality, but for equality to the last decimal place or something. But I don't see how this change should have affected this result, period, so probably should be viewed as a problem with the patch or Trove or some funky interaction. Sounds like Gson can't serialize/deserialize the trove class correctly because of some circular reference among the instances. Dunno why that would be a problem. But I think all this is moot since we can't use Trove? > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Fix For: 0.2 > > Attachments: mahout-165-trove.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.