[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944215#comment-15944215
 ] 

Timothy Hunter commented on SPARK-19634:
----------------------------------------

[~sethah], yes, thanks for bringing up these concerns. Regarding the first 
points, the UDAF interface does not let you update arrays in place, which is a 
non-starter in our case. This is why the implementation switches to TIA. I have 
updated the design doc with these comments.

Regarding the performance, I agree that there is a tension between having an 
API that is compatible with structured streaming and the current, RDD-based 
implementation. I will provide some test numbers so that we have a basis for 
discussion. That being said, the RDD API is not going away, so if users care 
about performance and do not need the additional benefit of integrating with 
SQL or structured streaming, they can still use it.

> Feature parity for descriptive statistics in MLlib
> --------------------------------------------------
>
>                 Key: SPARK-19634
>                 URL: https://issues.apache.org/jira/browse/SPARK-19634
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Timothy Hunter
>            Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to