SequentialAccessSparseVector should really *never* be used in a
mutating way.  You should use RandomAccessSparseVector if you're
going to mutate, and then *freeze* the results in a SASV when you're
done mutating it and you expect to be using it for only dot() and other
read-only operations which scan over the whole set of nonzero entries.

This should be documented better, perhaps.


On Wed, Apr 10, 2013 at 2:56 PM, Dan Filimon <[email protected]>wrote:

> Good news and bad news!
>
> After weeks of searching, I finally found the bug that made my code run so
> slowly.
> I'm using SequenceFileDirIterable to read the centroids from disk and it
> looks like it's creating SequentialAccessSparseVectors.
>
> Then, when assigning each point to its appropriate vector, we update the
> centroid itself using assign() and a custom function that does a weighted
> sum.
> Herein lies the problem.
>
> In the existing code, assign() comes from AbstractVector and if the
> function is not PLUS or PLUS_ABS, it does this:
>
> for (int i = 0; i < size; i++) {
>   setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
> }
>
> If "this" is a SequentialAccessSparseVector, getting is done in O(log size)
> steps as is setting (except when a new element needs to be inserted which
> can cause the two arrays in OrderedIntDoubleMapping to grow, which means a
> copy of the vector has to be made).
>
> Even assuming copying is not an issue (although it really is, the centroids
> grow more dense as more points are added), that is O(size log size) for one
> assignment where size is the dimension of the vector (NOT the number of
> mappings).
>
> In my case with vectors with 200K dimensions out of which at best 100 are
> non-zero this was VERY painful.
>
> Anyway, here's the link to an experimental patch [1] (it adds NO more
> tests/benchmarks).
> And here's the link to the issue [2].
>
> What should we do?
>
> [1] https://reviews.apache.org/r/10409/
> [2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314
>



-- 

  -jake

Reply via email to