[ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154849#comment-16154849
 ] 

Peng Meng commented on SPARK-21476:
-----------------------------------

I think the performance difference between my test and yours is the number of 
records in each partition. 
In my test, I use:
model.transform(data)
There are many instances (e.g. 10k) in each partition, so the deserialize time 
of the model is not a big part comparing with the end-to-end time.
In your cases, maybe there are very few instances in each partition, so the 
deserialize time seems very long.
In my cases, when use broadcast, the end-to-end time is long comparing the 
current solution.
We can share the test configuration and do more test. 

> RandomForest classification model not using broadcast in transform
> ------------------------------------------------------------------
>
>                 Key: SPARK-21476
>                 URL: https://issues.apache.org/jira/browse/SPARK-21476
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Saurabh Agrawal
>            Priority: Minor
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to