[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

mengxr Sun, 30 Mar 2014 11:24:06 -0700

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/268#issuecomment-39034276
  
    @yinxusen I want to see whether we can improve the performance. First of 
all, the complexity should be `O(nnz)` instead of `O(n d)`. This is a little 
bit tricky. Basically, we don't need to update zeros for a sparse vector. 
Suppose we have the following column:
    
    ~~~
    1.0
    0.0
    2.0
    0.0
    3.0
    0.0
    0.0
    ~~~
    
    To apply the formula for mean/variance, you need to do the update at the 
second row. This is actually not necessary because the summary statistics are 
invariant to the ordering. Imagine we have this column re-arranged:
    ~~~
    1.0
    2.0
    3.0
    0.0
    0.0
    0.0
    0.0
    ~~~
    which still have the same summary statistics. We can skip all zeros and do 
mean/variance updates based on the non-zero count instead of global count. 
After `aggregate`, we know we need to accumulate 4 zeros to the column, which 
is a constant-time operation.
    
    I'm okay if you don't have enough time to make the change in this PR. Put a 
TODO and we can fix it later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

Reply via email to