[ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779412#action_12779412 ]
Sean Owen commented on MAHOUT-201: ---------------------------------- Taking a step back -- OrderedIntDoubleMapping was a simple band-aid and placeholder. I had thought we'd replace it, as an implementation for SparseVector, with Colt or something. In fact it's my understanding we're keeping our interfaces and reimplementing in terms of Colt, completely. So should the question rather be, how can we juice up the Colt implementation? there's no our-implementation going forward. > OrderedIntDoubleMapping / SparseVector is unnecessarily slow > ------------------------------------------------------------ > > Key: MAHOUT-201 > URL: https://issues.apache.org/jira/browse/MAHOUT-201 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Jake Mannix > Fix For: 0.3 > > Attachments: MAHOUT-201.patch > > > In the work on MAHOUT-165, I find that while Colt's sparse vector > implementation is great from a hashing standpoint (it's memory efficient and > fast for random-access), they don't provide anything like the > OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for > random access, or out-of-order modification, but is minimally sized > memory-wise and blazingly fast for doing read-only dot-products and vector > sums (where the latter is read-only on inputs, and is creating new output) > with each other, and with DenseVectors. > This line of thinking got me looking back at the current SparseVector > implementation we have in Mahout, because it *is* based on an int[] and a > double[]. Unfortunately, it's not at all optimized for the cases where it > can outperform all other sparse impls: > * it should override dot(Vector) and plus(Vector) to check whether the input > is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap > implementation of SparseVector, that case as well), and do specialized > operations here. > * even when those particular methods aren't being used, the AllIterator and > NonZeroIterator inner classes are very inefficient: > ** minor things like caching the values.numMappings() and values.getIndices > in final instance variables in the Iterators > ** the huge performance drain of Element.get() : {code} public double get() { > return values.get(ind); } {code}, which is implemented as a binary search > on index values array (the index of which was already known!) followed by the > array lookup > This last point is probably the entire reason why we've seen performance > problems with the SparseVector, as it's in both the NonZeroIterator and the > AllIterator, and so turned any O(numNonZeroElements) operations into > O(numNonZeroElements * log(numNonZeroElements)) (with some additional > constant factors for too much indirection thrown in for good measure). > Unless there is another JIRA ticket which has a patch fixing this which I > didn't notice, I can whip up a patch (I've got a similar implementation over > in decomposer I can pull stuff out of, although mine is simpler because it is > immutable, so it's not just a copy and paste). > We don't have any benchmarking code anywhere yet, do we? Is there a JIRA > ticket open for that already? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.