[
https://issues.apache.org/jira/browse/SPARK-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711779#comment-14711779
]
Joseph K. Bradley commented on SPARK-10232:
-------------------------------------------
Regression test results in attached PNGs. Summary: I think there are not
significant regressions.
Tests were run using spark-perf:
* 5 trials, dropping first trial and averaging over others
* 16 EC2 workers (r3.2xlarge)
* Note: I separately looked at test error on held-out data, and there were no
regressions there either.
Details on results:
* spark.ml tests ran slightly slower in general, likely because of
DataFrame-RDD conversions. Since the underlying implementation still uses
RDDs, we should make sure to avoid converting to DataFrames when calling the
spark.ml implementation from spark.mllib. (We can provide a private RDD-based
API for training.)
* spark.ml RandomForests were slightly faster for deeper (depth 10) trees,
likely because of some slight efficiency improvements I made by removing "bins."
Conclusion: It is feasible to replace the implementation. However, we should
do it lazily since it will be a fair amount of work and there is no immediate
benefit.
> Decide whether spark.ml Decision Tree and Random Forest can replace
> spark.mllib implementation
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-10232
> URL: https://issues.apache.org/jira/browse/SPARK-10232
> Project: Spark
> Issue Type: Task
> Components: ML, MLlib
> Reporter: Joseph K. Bradley
> Assignee: Joseph K. Bradley
> Attachments: GBT.png, RandomForest.png
>
>
> This JIRA is for discussing replacing the spark.mllib DecisionTree and
> RandomForest implementations with the implementation in spark.ml. The new
> implementation is simply a copy, with slight modifications (removing "bins").
> Pros:
> * Support only 1 implementation.
> * Efficiency gains in spark.ml will benefit both APIs.
> Cons:
> * As spark.ml tree functionality increases, we will need to maintain
> conversion code for converting spark.ml trees to spark.mllib trees.
> Must:
> * Ensure we do not have significant regressions in the new implementation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]