[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645745#comment-16645745
 ] 

Jungtaek Lim commented on SPARK-10816:
--------------------------------------

Btw, while there was some inactivity on this issue, I've spent some times to 
compare both approaches (my patch and Baidu's patch), as well as craft and run 
some simple benchmark. 

 

> Analysis report on comparing HWX's patch (my patch) and Baidu's patch.

[https://docs.google.com/document/d/1hdh6GNLzprzlSJDDa4UKMyNQ9-u_H-CDpEtvtggxCL0/edit?usp=sharing]

Please note that the report can be biased since HWX's patch is authored by 
myself, but it may be helpful to understand the difference of performance 
between my patch and Baidu's patch.

 

> Benchmark

 

You can refer the benchmark code what I’ve crafted from below link.

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e]

 

The test is based on word count sessionization (adopt sessionization example in 
Spark and migrate to session window function), and the test targets the case: 
total number of word are limited (around 290), and every batch has massive 
events for each word.

(Sure, there’re many data patterns for session window, and what I tested was 
only one of, but we can see the performance difference at a glance from a 
single case, and we can add more cases.)

Please refer below how query is composed as well as how test data looks like:

[BaseBenchmarkSessionWindowListenerWordCountSessionFunction.scala|https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e#diff-cdba128ef790fcf9395a04a09cdf5cd9]

[TestSentences.scala|https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e#diff-5759bf22e2d199da550672458a17498b]

 

Since Baidu’s patch supports Complete mode and Append mode, I ran benchmark 
with Append mode while comparing HWX’s patch with Baidu’s patch.

While Baidu’s patch cannot keep up input rate 200 (it showed max processed rows 
per second as around 130), HWX’s patch (APPEND mode) can keep the input rate 
around 23000.

(Initial input rate was 1000 but Baidu’s patch got very slowed with 
consistently showing "Reached spill threshold of 4096 rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter".)

For me, the huge difference of performance is expected based on the analysis, 
though the outstanding gap actually surprised me.

 

 

I’ve also crafted benchmark code for mapGroupsWithState (similar with 
sessionization example) and compared with HWX’s patch.

(Since I can't use same code while testing in this case, the quality of the 
implementation could affect overall result. I’m not sure I crafted benchmark 
code fair, or bad for mapGroupsWithState.)

Please refer 
[BaseBenchmarkSessionWindowListenerWordCountMapGroupsWithStateUpdateMode.scala|https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e#diff-8f39b0d79d2a05998bab9e317c0cb67f]

While mapGroupsWithState (UPDATE mode) can keep the input rate around 10000, 
HWX’s patch (UPDATE mode) can keep the input rate around 22000.

While I really want to see someone tune my benchmark code on mapGroupsWithState 
and I'd be happy to run benchmark again, but at a glance, it looks like my 
patch is not struggling with performance hit.

 

> EventTime based sessionization
> ------------------------------
>
>                 Key: SPARK-10816
>                 URL: https://issues.apache.org/jira/browse/SPARK-10816
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Reynold Xin
>            Priority: Major
>         Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to