[
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944215#comment-15944215
]
Timothy Hunter commented on SPARK-19634:
----------------------------------------
[~sethah], yes, thanks for bringing up these concerns. Regarding the first
points, the UDAF interface does not let you update arrays in place, which is a
non-starter in our case. This is why the implementation switches to TIA. I have
updated the design doc with these comments.
Regarding the performance, I agree that there is a tension between having an
API that is compatible with structured streaming and the current, RDD-based
implementation. I will provide some test numbers so that we have a basis for
discussion. That being said, the RDD API is not going away, so if users care
about performance and do not need the additional benefit of integrating with
SQL or structured streaming, they can still use it.
> Feature parity for descriptive statistics in MLlib
> --------------------------------------------------
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Timothy Hunter
> Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]