[
https://issues.apache.org/jira/browse/SPARK-21476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101443#comment-16101443
]
Saurabh Agrawal commented on SPARK-21476:
-----------------------------------------
[[email protected]] My streaming application is suffering from increased
latency in the stage where I use this model for prediction. In the spark UI for
this stage, I could see that task execution time is alright but each of the
tasks takes a bigger chunk of the time in deserialization. When I comment out
the line where I call model.transform and use dummy values there instead, run
the same application in the same environment, the task deserialization time
reduces and overall the stage executes significantly faster.
I also tried xgboost model with sparkml compatible third party libraries
([https://github.com/komiya-atsushi/xgboost-predictor-java/blob/master/xgboost-predictor-spark/src/main/scala/biz/k11i/xgboost/spark/model/XGBoostBinaryClassificationModel.scala]).
Just like RandomForestClassificationModel, this model subclasses
ProbabilisticClassificationModel hence it uses the same transform method. I
noticed the same kind of task deserialization and stage execution times as with
RF model. I cloned this repo and copy pasted the transform method from
ProbabilisticClassifier into XGBoostBinaryClassificationModel, only this time I
added broadcasting inside transform. This change brought down the execution
time significantly.
[~srowen] Adding broadcast in ProbabilisticClassifier transform implementation
can fix this, i.e. broadcasting the model instance and calling predictRaw,
raw2Probability and other row level methods on this broadcast value.
> RandomForest classification model not using broadcast in transform
> ------------------------------------------------------------------
>
> Key: SPARK-21476
> URL: https://issues.apache.org/jira/browse/SPARK-21476
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.2.0
> Reporter: Saurabh Agrawal
>
> I notice significant task deserialization latency while running prediction
> with pipelines using RandomForestClassificationModel. While digging into the
> source, found that the transform method in RandomForestClassificationModel
> binds to its parent ProbabilisticClassificationModel and the only concrete
> definition that RandomForestClassificationModel provides and which is
> actually used in transform is that of predictRaw. Broadcasting is not being
> used in predictRaw.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]