[ https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779403#action_12779403 ]
Shashikant Kore commented on MAHOUT-201: ---------------------------------------- Jake, Colt also provides fast iteration on key-value pairs. I think, IntDoublePairIterator in this patch is same as IntDoubleProcedure of colt. Check the private class AddToVector, which inherits IntDoubleProcedure, in my patch on 165. Is that something you want? > OrderedIntDoubleMapping / SparseVector is unnecessarily slow > ------------------------------------------------------------ > > Key: MAHOUT-201 > URL: https://issues.apache.org/jira/browse/MAHOUT-201 > Project: Mahout > Issue Type: Improvement > Components: Matrix > Affects Versions: 0.2 > Reporter: Jake Mannix > Fix For: 0.3 > > Attachments: MAHOUT-201.patch > > > In the work on MAHOUT-165, I find that while Colt's sparse vector > implementation is great from a hashing standpoint (it's memory efficient and > fast for random-access), they don't provide anything like the > OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for > random access, or out-of-order modification, but is minimally sized > memory-wise and blazingly fast for doing read-only dot-products and vector > sums (where the latter is read-only on inputs, and is creating new output) > with each other, and with DenseVectors. > This line of thinking got me looking back at the current SparseVector > implementation we have in Mahout, because it *is* based on an int[] and a > double[]. Unfortunately, it's not at all optimized for the cases where it > can outperform all other sparse impls: > * it should override dot(Vector) and plus(Vector) to check whether the input > is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap > implementation of SparseVector, that case as well), and do specialized > operations here. > * even when those particular methods aren't being used, the AllIterator and > NonZeroIterator inner classes are very inefficient: > ** minor things like caching the values.numMappings() and values.getIndices > in final instance variables in the Iterators > ** the huge performance drain of Element.get() : {code} public double get() { > return values.get(ind); } {code}, which is implemented as a binary search > on index values array (the index of which was already known!) followed by the > array lookup > This last point is probably the entire reason why we've seen performance > problems with the SparseVector, as it's in both the NonZeroIterator and the > AllIterator, and so turned any O(numNonZeroElements) operations into > O(numNonZeroElements * log(numNonZeroElements)) (with some additional > constant factors for too much indirection thrown in for good measure). > Unless there is another JIRA ticket which has a patch fixing this which I > didn't notice, I can whip up a patch (I've got a similar implementation over > in decomposer I can pull stuff out of, although mine is simpler because it is > immutable, so it's not just a copy and paste). > We don't have any benchmarking code anywhere yet, do we? Is there a JIRA > ticket open for that already? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.