[
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003190#comment-15003190
]
Xiangrui Meng commented on SPARK-10384:
---------------------------------------
I marked this JIRA as resolved. Approximate median/quantiles and mode will be
addressed as follow-up work.
> Univariate statistics as UDAFs
> ------------------------------
>
> Key: SPARK-10384
> URL: https://issues.apache.org/jira/browse/SPARK-10384
> Project: Spark
> Issue Type: Umbrella
> Components: ML, SQL
> Reporter: Xiangrui Meng
> Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> It would be nice to define univariate statistics as UDAFs. This JIRA
> discusses general implementation and tracks the process of subtasks.
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness,
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might
> depend on mean and count. It would be nice if SQL can optimize the sequence
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * -range- (SPARK-10861) - won't add
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761) -> 1.7.0
> * approximate quantiles (SPARK-6761) -> 1.7.0
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) -> 1.7.0
> * -number of categories- (This is COUNT DISTINCT in SQL.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]