[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

mengxr Fri, 28 Mar 2014 23:44:29 -0700

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/268#issuecomment-38988446
  
    @yinxusen Thanks for working on this! I don't think row statistics are 
important because they represent values for different features. For column 
statistics, instead of implementing each statistic separately, we can compute 
all common statistics like (n, nnz, mean, variance, max, min) in a single job. 
This adds little overhead to the computation.
    
    Btw, `Vector.toArray` is an expensive operation for `SparseVector`. You 
should use breeze's axpy to aggregate vectors.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] [SPARK-1328] Add vector statistics

Reply via email to