[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880623#comment-15880623 ]
Nick Pentreath commented on SPARK-19634: ---------------------------------------- Thanks [~timhunter]. In terms of performance, we expect to gain from (a) not computing unnecessary metrics or values (saving mainly in memory usage for the intermediate arrays created, potentially some computation saving); and (b) using UDAF. Do we expect a large gain from using UDAF? I'm not totally up to date on the current state of UDAF integration into working with Tungsten data, but my last impression was that (a) UDAFs didn't really offer this unless they're internal (like HyperLogLog) and (b) array storage & SerDe in Tungsten was still a bit patchy. Has this changed? Of course in terms of API it is beneficial and we should do it anyway under the assumption that performance is at least the same as the current implementation. I just want to understand the expected performance gains since the implicit assumption is always "DataFrame operations will be so much faster" but in practice this is not always the case for more complex data types & situations, and things that switch into RDDs anyway under the hood such as in the linear models cases... > Feature parity for descriptive statistics in MLlib > -------------------------------------------------- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 2.1.0 > Reporter: Timothy Hunter > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org