[
https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541433#comment-14541433
]
Joseph K. Bradley commented on SPARK-7127:
------------------------------------------
The mapPartitions function is really an RDD method which has to return an RDD
instead of a DataFrame. By using it, you end up creating 2 RDDs/DataFrames
which must then be joined. You're trying to do that with "withColumn," but you
would have to use join.
However, a better approach will be to stick with DataFrame-only methods which
return DataFrames, not RDDs. To do that, you can broadcast the model and then
use it in a UDF. (Search the spark.ml code for "callUDF" method invocations
for examples.) That UDF can be used with "withColumn" to add the prediction
column.
> Broadcast spark.ml tree ensemble models for predict
> ---------------------------------------------------
>
> Key: SPARK-7127
> URL: https://issues.apache.org/jira/browse/SPARK-7127
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 1.4.0
> Reporter: Joseph K. Bradley
> Priority: Minor
>
> GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast
> models and then predict. This will mean overriding transform().
> Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]