[jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Shashikant Kore (JIRA) Wed, 18 Nov 2009 02:54:06 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779403#action_12779403
 ]


Shashikant Kore commented on MAHOUT-201:
----------------------------------------

Jake,

Colt also provides fast iteration on key-value pairs. 

I think,  IntDoublePairIterator in this patch is same as IntDoubleProcedure of 
colt. Check the private class AddToVector, which inherits IntDoubleProcedure, 
in my patch on 165. Is that something you want?


> OrderedIntDoubleMapping / SparseVector is unnecessarily slow
> ------------------------------------------------------------
>
>                 Key: MAHOUT-201
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-201
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-201.patch
>
>
> In the work on MAHOUT-165, I find that while Colt's sparse vector 
> implementation is great from a hashing standpoint (it's memory efficient and 
> fast for random-access), they don't provide anything like the 
> OrderedIntDoublePair - i.e. a vector implementation which is *not* fast for 
> random access, or out-of-order modification, but is minimally sized 
> memory-wise and blazingly fast for doing read-only dot-products and vector 
> sums (where the latter is read-only on inputs, and is creating new output) 
> with each other, and with DenseVectors.
> This line of thinking got me looking back at the current SparseVector 
> implementation we have in Mahout, because it *is* based on an int[] and a 
> double[].  Unfortunately, it's not at all optimized for the cases where it 
> can outperform all other sparse impls:
> * it should override dot(Vector) and plus(Vector) to check whether the input 
> is a DenseVector or a SparseVector (or, once we have an OpenIntDoubleMap 
> implementation of SparseVector, that case as well), and do specialized 
> operations here.
> * even when those particular methods aren't being used, the AllIterator and 
> NonZeroIterator inner classes are very inefficient:
> ** minor things like caching the values.numMappings() and values.getIndices 
> in final instance variables in the Iterators
> ** the huge performance drain of Element.get() : {code} public double get() { 
>  return values.get(ind);  } {code}, which is implemented as a binary search 
> on index values array (the index of which was already known!) followed by the 
> array lookup
> This last point is probably the entire reason why we've seen performance 
> problems with the SparseVector, as it's in both the NonZeroIterator and the 
> AllIterator, and so turned any O(numNonZeroElements) operations into 
> O(numNonZeroElements * log(numNonZeroElements)) (with some additional 
> constant factors for too much indirection thrown in for good measure).
> Unless there is another JIRA ticket which has a patch fixing this which I 
> didn't notice, I can whip up a patch (I've got a similar implementation over 
> in decomposer I can pull stuff out of, although mine is simpler because it is 
> immutable, so it's not just a copy and paste).
> We don't have any benchmarking code anywhere yet, do we?  Is there a JIRA 
> ticket open for that already?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-201) OrderedIntDoubleMapping / SparseVector is unnecessarily slow

Reply via email to