Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/268#issuecomment-39034276
@yinxusen I want to see whether we can improve the performance. First of
all, the complexity should be `O(nnz)` instead of `O(n d)`. This is a little
bit tricky. Basically, we don't need to update zeros for a sparse vector.
Suppose we have the following column:
~~~
1.0
0.0
2.0
0.0
3.0
0.0
0.0
~~~
To apply the formula for mean/variance, you need to do the update at the
second row. This is actually not necessary because the summary statistics are
invariant to the ordering. Imagine we have this column re-arranged:
~~~
1.0
2.0
3.0
0.0
0.0
0.0
0.0
~~~
which still have the same summary statistics. We can skip all zeros and do
mean/variance updates based on the non-zero count instead of global count.
After `aggregate`, we know we need to accumulate 4 zeros to the column, which
is a constant-time operation.
I'm okay if you don't have enough time to make the change in this PR. Put a
TODO and we can fix it later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---