[
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094182#comment-16094182
]
Saurabh Agrawal edited comment on SPARK-21476 at 7/20/17 5:16 AM:
------------------------------------------------------------------
I'm saying that the trees in the model get serialized with each task which
increases the task deserialization time if the forest is big.
I see that there is a transformImpl in RandomForestClassificationModel which is
broadcasting itself first and then calling predict on the broadcast value
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L207-L213).
But transformImpl is not getting invoked by the transform method in
ProbabilisticClassificationModel. Instead ProbabilisticClassificationModel uses
the concrete class definition of predictRaw.
transorm is a distributed operation but the trees contained within the model do
not get broadcast and are instead serialized with each task. Is this intended
behavior?
was (Author: sagraw):
I'm saying that the trees in the model get serialized with each task which
increases the task deserialization time if the forest is big.
I see that there is a transformImpl in RandomForestClassificationModel which is
broadcasting itself first and then calling predict on the broadcast value
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala#L207-L213).
But transformImpl is not getting invoked by the transform method in
ProbabilisticClassificationModel. Instead ProbabilisticClassificationModel uses
the concrete class definition of predictRaw.
transorm is a distributed operation but the trees contained within the model do
not get broadcast and instead are serialized with each task. Is this intended
behavior?
> RandomForest classification model not using broadcast in transform
> ------------------------------------------------------------------
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.2.0
> Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction
> with pipelines using RandomForestClassificationModel. While digging into the
> source, found that the transform method in RandomForestClassificationModel
> binds to its parent ProbabilisticClassificationModel and the only concrete
> definition that RandomForestClassificationModel provides and which is
> actually used in transform is that of predictRaw. Broadcasting is not being
> used in predictRaw.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]