[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

Saurabh Agrawal (JIRA) Wed, 26 Jul 2017 02:32:34 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101443#comment-16101443
 ]


Saurabh Agrawal commented on SPARK-21476:
-----------------------------------------

[[email protected]] My streaming application is suffering from increased 
latency in the stage where I use this model for prediction. In the spark UI for 
this stage, I could see that task execution time is alright but each of the 
tasks takes a bigger chunk of the time in deserialization. When I comment out 
the line where I call model.transform and use dummy values there instead, run 
the same application in the same environment, the task deserialization time 
reduces and overall the stage executes significantly faster. 

I also tried xgboost model with sparkml compatible third party libraries 
([https://github.com/komiya-atsushi/xgboost-predictor-java/blob/master/xgboost-predictor-spark/src/main/scala/biz/k11i/xgboost/spark/model/XGBoostBinaryClassificationModel.scala]).
 Just like RandomForestClassificationModel, this model subclasses 
ProbabilisticClassificationModel hence it uses the same transform method. I 
noticed the same kind of task deserialization and stage execution times as with 
RF model. I cloned this repo and copy pasted the transform method from 
ProbabilisticClassifier into XGBoostBinaryClassificationModel, only this time I 
added broadcasting inside transform. This change brought down the execution 
time significantly. 

[~srowen] Adding broadcast in ProbabilisticClassifier transform implementation 
can fix this, i.e. broadcasting the model instance and calling predictRaw, 
raw2Probability and other row level methods on this broadcast value.


> RandomForest classification model not using broadcast in transform
> ------------------------------------------------------------------
>
>                 Key: SPARK-21476
>                 URL: https://issues.apache.org/jira/browse/SPARK-21476
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction 
> with pipelines using RandomForestClassificationModel. While digging into the 
> source, found that the transform method in RandomForestClassificationModel 
> binds to its parent ProbabilisticClassificationModel and the only concrete 
> definition that RandomForestClassificationModel provides and which is 
> actually used in transform is that of predictRaw. Broadcasting is not being 
> used in predictRaw.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21476) RandomForest classification model not using broadcast in transform

Reply via email to