[
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-24106.
----------------------------------
Resolution: Incomplete
> Spark Structure Streaming with RF model taking long time in processing
> probability for each mini batch
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
> Reporter: Tamilselvan Veeramani
> Priority: Major
> Labels: bulk-closed, performance
>
> RandomForestClassificationModel broadcasted to executors for every mini batch
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range:
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
> 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will
> help many people who is looking to use RF model for “probability” in real
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method
> implements only “prediction” in current version of spark. Which can be
> leveraged to implement “probability” also in RandomForestClassificationModel
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any
> of the forums, Can you please review and put this fix in next release ? thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]