[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

Nick Pentreath (JIRA) Thu, 23 Feb 2017 07:20:15 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880623#comment-15880623
 ]


Nick Pentreath commented on SPARK-19634:
----------------------------------------

Thanks [~timhunter].

In terms of performance, we expect to gain from (a) not computing unnecessary 
metrics or values (saving mainly in memory usage for the intermediate arrays 
created, potentially some computation saving); and (b) using UDAF.

Do we expect a large gain from using UDAF? I'm not totally up to date on the 
current state of UDAF integration into working with Tungsten data, but my last 
impression was that (a) UDAFs didn't really offer this unless they're internal 
(like HyperLogLog) and (b) array storage & SerDe in Tungsten was still a bit 
patchy. Has this changed?

Of course in terms of API it is beneficial and we should do it anyway under the 
assumption that performance is at least the same as the current implementation. 
I just want to understand the expected performance gains since the implicit 
assumption is always "DataFrame operations will be so much faster" but in 
practice this is not always the case for more complex data types & situations, 
and things that switch into RDDs anyway under the hood such as in the linear 
models cases...  

> Feature parity for descriptive statistics in MLlib
> --------------------------------------------------
>
>                 Key: SPARK-19634
>                 URL: https://issues.apache.org/jira/browse/SPARK-19634
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

Reply via email to