Good news and bad news!

After weeks of searching, I finally found the bug that made my code run so
slowly.
I'm using SequenceFileDirIterable to read the centroids from disk and it
looks like it's creating SequentialAccessSparseVectors.

Then, when assigning each point to its appropriate vector, we update the
centroid itself using assign() and a custom function that does a weighted
sum.
Herein lies the problem.

In the existing code, assign() comes from AbstractVector and if the
function is not PLUS or PLUS_ABS, it does this:

for (int i = 0; i < size; i++) {
  setQuick(i, function.apply(getQuick(i), other.getQuick(i)));
}

If "this" is a SequentialAccessSparseVector, getting is done in O(log size)
steps as is setting (except when a new element needs to be inserted which
can cause the two arrays in OrderedIntDoubleMapping to grow, which means a
copy of the vector has to be made).

Even assuming copying is not an issue (although it really is, the centroids
grow more dense as more points are added), that is O(size log size) for one
assignment where size is the dimension of the vector (NOT the number of
mappings).

In my case with vectors with 200K dimensions out of which at best 100 are
non-zero this was VERY painful.

Anyway, here's the link to an experimental patch [1] (it adds NO more
tests/benchmarks).
And here's the link to the issue [2].

What should we do?

[1] https://reviews.apache.org/r/10409/
[2] https://issues.apache.org/jira/browse/MAHOUT-1190#comment-13628314

Reply via email to