[ 
https://issues.apache.org/jira/browse/SPARK-10384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003190#comment-15003190
 ] 

Xiangrui Meng commented on SPARK-10384:
---------------------------------------

I marked this JIRA as resolved. Approximate median/quantiles and mode will be 
addressed as follow-up work.

> Univariate statistics as UDAFs
> ------------------------------
>
>                 Key: SPARK-10384
>                 URL: https://issues.apache.org/jira/browse/SPARK-10384
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, SQL
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>             Fix For: 1.6.0
>
>
> It would be nice to define univariate statistics as UDAFs. This JIRA 
> discusses general implementation and tracks the process of subtasks. 
> Univariate statistics include:
> continuous: min, max, range, variance, stddev, median, quantiles, skewness, 
> and kurtosis
> categorical: number of categories, mode
> If we define them as UDAFs, it would be quite flexible to use them with 
> DataFrames, e.g.,
> {code}
> df.groupBy("key").agg(min("x"), min("y"), variance("x"), skewness("x"))
> {code}
> Note that some univariate statistics depend on others, e.g., variance might 
> depend on mean and count. It would be nice if SQL can optimize the sequence 
> to avoid duplicate computation.
> Univariate statistics for continuous variables:
> * -min-
> * -max-
> * -range- (SPARK-10861) - won't add
> * -mean-
> * sample variance (SPARK-9296)
> * population variance (SPARK-9296)
> * -sample standard deviation- (SPARK-6458)
> * -population standard deviation- (SPARK-6458)
> * skewness (SPARK-10641)
> * kurtosis (SPARK-10641)
> * approximate median (SPARK-6761) -> 1.7.0
> * approximate quantiles (SPARK-6761) -> 1.7.0
> Univariate statistics for categorical variables:
> * mode: https://en.wikipedia.org/wiki/Mode_(statistics) (SPARK-10936) -> 1.7.0
> * -number of categories- (This is COUNT DISTINCT in SQL.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to