[
https://issues.apache.org/jira/browse/SPARK-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DB Tsai updated SPARK-1969:
---------------------------
Description:
Basically, it moves the private ColumnStatisticsAggregator class from RowMatrix
to public available DeveloperApi.
Changes:
1) Moved the trait from
org.apache.spark.mllib.stat.MultivariateStatisticalSummary to
org.apache.spark.mllib.stats.Summarizer
2) Moved the private implementation from org.apache.spark.mllib.linalg.
ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
3) Added the API documentation for OnlineSummarizer
4) Added the unittest for OnlineSummarizer
was:
Basically, it will be a ported from mahout's OnlineSummarizer
https://github.com/apache/mahout/blob/master/math/src/main/java/org/apache/mahout/math/stats/OnlineSummarizer.java
Computes on-line estimates of mean, variance and all five quartiles (notably
including the median). Since this is done in a completely incremental fashion
(that is what is meant by on-line) estimates are available at any time and the
amount of memory used is constant.
Somewhat surprisingly, the quantile estimates are about as good as you would
get if you actually kept all of the samples.
The method used for mean and variance is Welford's method. See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
The method used for computing the quartiles is a simplified form of the
stochastic approximation method described in the article "Incremental Quantile
Estimation for Massive Tracking" by Chen, Lambert and Pinheiro
Issue Type: Improvement (was: New Feature)
Summary: Public available online summarizer for mean, variance, min,
and max (was: Online Summarizer for mean, variance, min, max, and quartile)
> Public available online summarizer for mean, variance, min, and max
> -------------------------------------------------------------------
>
> Key: SPARK-1969
> URL: https://issues.apache.org/jira/browse/SPARK-1969
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: DB Tsai
>
> Basically, it moves the private ColumnStatisticsAggregator class from
> RowMatrix to public available DeveloperApi.
> Changes:
> 1) Moved the trait from
> org.apache.spark.mllib.stat.MultivariateStatisticalSummary to
> org.apache.spark.mllib.stats.Summarizer
> 2) Moved the private implementation from org.apache.spark.mllib.linalg.
> ColumnStatisticsAggregator to org.apache.spark.mllib.stats.OnlineSummarizer
> 3) Added the API documentation for OnlineSummarizer
> 4) Added the unittest for OnlineSummarizer
--
This message was sent by Atlassian JIRA
(v6.2#6252)