[jira] [Commented] (SPARK-10232) Decide whether spark.ml Decision Tree and Random Forest can replace spark.mllib implementation

Joseph K. Bradley (JIRA) Tue, 25 Aug 2015 11:54:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711779#comment-14711779
 ]


Joseph K. Bradley commented on SPARK-10232:
-------------------------------------------

Regression test results in attached PNGs.  Summary: I think there are not 
significant regressions.

Tests were run using spark-perf:
* 5 trials, dropping first trial and averaging over others
* 16 EC2 workers (r3.2xlarge)
* Note: I separately looked at test error on held-out data, and there were no 
regressions there either.

Details on results:
* spark.ml tests ran slightly slower in general, likely because of 
DataFrame-RDD conversions.  Since the underlying implementation still uses 
RDDs, we should make sure to avoid converting to DataFrames when calling the 
spark.ml implementation from spark.mllib.  (We can provide a private RDD-based 
API for training.)
* spark.ml RandomForests were slightly faster for deeper (depth 10) trees, 
likely because of some slight efficiency improvements I made by removing "bins."

Conclusion: It is feasible to replace the implementation.  However, we should 
do it lazily since it will be a fair amount of work and there is no immediate 
benefit.

> Decide whether spark.ml Decision Tree and Random Forest can replace 
> spark.mllib implementation
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-10232
>                 URL: https://issues.apache.org/jira/browse/SPARK-10232
>             Project: Spark
>          Issue Type: Task
>          Components: ML, MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>         Attachments: GBT.png, RandomForest.png
>
>
> This JIRA is for discussing replacing the spark.mllib DecisionTree and 
> RandomForest implementations with the implementation in spark.ml.  The new 
> implementation is simply a copy, with slight modifications (removing "bins").
> Pros:
> * Support only 1 implementation.
> * Efficiency gains in spark.ml will benefit both APIs.
> Cons:
> * As spark.ml tree functionality increases, we will need to maintain 
> conversion code for converting spark.ml trees to spark.mllib trees.
> Must:
> * Ensure we do not have significant regressions in the new implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-10232) Decide whether spark.ml Decision Tree and Random Forest can replace spark.mllib implementation

Reply via email to