[ https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776748#action_12776748 ]
Jake Mannix commented on MAHOUT-165: ------------------------------------ *bump* So what is the collective vision on this now? Shashi says that this current impl (from this patch) is pretty slow, but I was thinking that it is actually just some bugs in the IntDoubleMap based impl (mentioned above). Has anyone looked at that other vector implementation? Once we have this patch working, we can compare it to colt, and see how they stand up. The question remains whether we should consider Commons-Math as our underlying linear package. We can't really use cmath 2.0, because their api is missing some key features we need (iterators, for one thing, and they only have one sparse vector impl, based on the map, and don't also have an IntDoubleMapping based one, which is pretty key for performance of some sparse algs). They do have lots of great solvers, but I've been seeing a disturbing number of bugs fixed on their SVD implementation lately (I mean, fixing bugs is great, but having them in there means we don't know how many more there are), and the impls there are translated from fortran, and the code is very dense and hard for me to debug if I see a bug pop up, personally. The other option is to just get something like MTJ, but they depend on some f2j stuff, which is keeping commons-math from taking them, even though MTJ itself can be appropriately licensed. Personally, while I hate reinventing wheels, I hate even more having to depend on other libraries which don't quite have the api we need or the functionality we need, when the primitives involved are so core to much of what we want to do. So I'm more in favor of doing one of two things: * use our own primitives as is, and improve them, remembering that our core competency is on *scaling*, so optimizing for the 1000-element dense double[] vectors and 10k double[][] matrices isn't where we care as much, and most linear libraries aren't in the same mode of thinking. * rip the unacceptably licensed parts of Colt out and use the rest for our linear routines. The hep.aida.* packages are not needed and can be removed without losing the key functionality, and these packages are the only ones which aren't apache compatible. Thoughts? We need to settle on at least a wrapper linear API which we'll delegate to if we're going to allow that possibility, but even that involves a bunch of decisions on what should be in that api - not all implementations can do everything, so this form of flexibility comes at a price. > Using better primitives hash for sparse vector for performance gains > -------------------------------------------------------------------- > > Key: MAHOUT-165 > URL: https://issues.apache.org/jira/browse/MAHOUT-165 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Shashikant Kore > Assignee: Grant Ingersoll > Fix For: 0.3 > > Attachments: colt.jar, mahout-165-trove.patch, > MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch > > > In SparseVector, we need primitives hash map for index and values. The > present implementation of this hash map is not as efficient as some of the > other implementations in non-Apache projects. > In an experiment, I found that, for get/set operations, the primitive hash of > Colt performance an order of magnitude better than OrderedIntDoubleMapping. > For iteration it is 2x slower, though. > Using Colt in Sparsevector improved performance of canopy generation. For an > experimental dataset, the current implementation takes 50 minutes. Using > Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the > delay. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.