[ 
https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526943#comment-14526943
 ] 

Joseph K. Bradley commented on SPARK-7127:
------------------------------------------

[~bryanc]

1.  Broadcasting ensures the object is sent to each worker only once.  I'd use 
mapPartitions as was done here 
[https://github.com/apache/spark/blob/5a1a1075a607be683f008ef92fa227803370c45f/mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala]
 so that you only call bcastModel.value once per partition.  The only 
difference is that you should not unpersist the broadcast variable since it 
will be needed by the returned DataFrame in the future.

2. Some of the boilerplate in Predictor.transform will go away when this PR 
gets merged: [https://github.com/apache/spark/pull/5820].  As for the rest, I 
don't think we need another trait, but perhaps this would help reduce duplicate 
code:
* Add a protected Predictor.transformImpl method which is almost the same as 
transform() but takes an additional argument: a Boolean indicating whether or 
not to broadcast the model.
* Change Predictor.transform to call transformImpl with the Boolean set to 
false.
* Have the ensemble models override transform() and call transformImpl with the 
bit set appropriately.

I'm going to remove the "starter" tag since this is more complex than I 
expected, but feel free to take a shot at it still!

> Broadcast spark.ml tree ensemble models for predict
> ---------------------------------------------------
>
>                 Key: SPARK-7127
>                 URL: https://issues.apache.org/jira/browse/SPARK-7127
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast 
> models and then predict.  This will mean overriding transform().
> Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to