[ 
https://issues.apache.org/jira/browse/SPARK-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-1969:
---------------------------

    Description: 
Basically, it moves the private ColumnStatisticsAggregator class from RowMatrix 
to public available DeveloperApi. 

Changes:
1) Moved the trait from 
org.apache.spark.mllib.stat.MultivariateStatisticalSummary to 
org.apache.spark.mllib.stats.Summarizer 
2) Moved the private implementation from org.apache.spark.mllib.linalg. 
ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
3) Added the API documentation for OnlineSummarizer
4) Added the unittest for OnlineSummarizer

  was:
Basically, it will be a ported from mahout's OnlineSummarizer

https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/OnlineSummarizer.java

Computes on-line estimates of mean, variance and all five quartiles (notably 
including the median).  Since this is done in a completely incremental fashion 
(that is what is meant by on-line) estimates are available at any time and the 
amount of memory used is constant.  

Somewhat surprisingly, the quantile estimates are about as good as you would 
get if you actually kept all of the samples.
 
The method used for mean and variance is Welford's method.  See
 
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm

The method used for computing the quartiles is a simplified form of the 
stochastic approximation method described in the article "Incremental Quantile 
Estimation for Massive Tracking" by Chen, Lambert and Pinheiro


     Issue Type: Improvement  (was: New Feature)
        Summary: Public available online summarizer for mean, variance, min, 
and max  (was: Online Summarizer for mean, variance, min, max, and quartile)

> Public available online summarizer for mean, variance, min, and max
> -------------------------------------------------------------------
>
>                 Key: SPARK-1969
>                 URL: https://issues.apache.org/jira/browse/SPARK-1969
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>
> Basically, it moves the private ColumnStatisticsAggregator class from 
> RowMatrix to public available DeveloperApi. 
> Changes:
> 1) Moved the trait from 
> org.apache.spark.mllib.stat.MultivariateStatisticalSummary to 
> org.apache.spark.mllib.stats.Summarizer 
> 2) Moved the private implementation from org.apache.spark.mllib.linalg. 
> ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
> 3) Added the API documentation for OnlineSummarizer
> 4) Added the unittest for OnlineSummarizer



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to