SequentialAccessSparseVector should really *never* be used in a mutating way. You should use RandomAccessSparseVector if you're going to mutate, and then *freeze* the results in a SASV when you're done mutating it and you expect to be using it for only dot() and other read-only operations which scan over the whole set of nonzero entries.
This should be documented better, perhaps. On Wed, Apr 10, 2013 at 2:56 PM, Dan Filimon <[email protected]>wrote: > Good news and bad news! > > After weeks of searching, I finally found the bug that made my code run so > slowly. > I'm using SequenceFileDirIterable to read the centroids from disk and it > looks like it's creating SequentialAccessSparseVectors. > > Then, when assigning each point to its appropriate vector, we update the > centroid itself using assign() and a custom function that does a weighted > sum. > Herein lies the problem. > > In the existing code, assign() comes from AbstractVector and if the > function is not PLUS or PLUS_ABS, it does this: > > for (int i = 0; i < size; i++) { > setQuick(i, function.apply(getQuick(i), other.getQuick(i))); > } > > If "this" is a SequentialAccessSparseVector, getting is done in O(log size) > steps as is setting (except when a new element needs to be inserted which > can cause the two arrays in OrderedIntDoubleMapping to grow, which means a > copy of the vector has to be made). > > Even assuming copying is not an issue (although it really is, the centroids > grow more dense as more points are added), that is O(size log size) for one > assignment where size is the dimension of the vector (NOT the number of > mappings). > > In my case with vectors with 200K dimensions out of which at best 100 are > non-zero this was VERY painful. > > Anyway, here's the link to an experimental patch [1] (it adds NO more > tests/benchmarks). > And here's the link to the issue [2]. > > What should we do? > > [1] https://reviews.apache.org/r/10409/ > [2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314 > -- -jake
