[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761021#action_12761021 ]
Ted Dunning commented on MAHOUT-165: ------------------------------------ THanks Jake, that could be very helpful. The throwing of "Impossible confusion" is done in situations where an impossible condition has been detected. For instance, since hash tables are resized when they become partially filled, it should be impossible for the search loop to exit without finding an empty cell or a match. When programming, I have difficulty pronouncing "should" so I try to detect the situation and signal it with an unchecked exception. I usually define something like "ImpossibleConditionException", but didn't in this case. I use an unchecked exception because it is clear that the application is not going to be much able to recover from a situation that I don't think could occur. I left the hard-coding of one option or the other in place because I could see my patch extending into everything everywhere and wanted to limit the scope of the change. You are right that we need to think about how that works. In most cases, I think that hard-coding is fine just like hard-coding the use of an ArrayList in some application is not subject to user over-ride. There are a few cases where this isn't try, but I think that usually that means that the vector or matrix should be passed in. The use of like() may also be indicated. > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.2 > > Attachments: colt.jar, mahout-165-trove.patch, > MAHOUT-165-updated.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.