[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645745#comment-16645745 ]
Jungtaek Lim commented on SPARK-10816: -------------------------------------- Btw, while there was some inactivity on this issue, I've spent some times to compare both approaches (my patch and Baidu's patch), as well as craft and run some simple benchmark. > Analysis report on comparing HWX's patch (my patch) and Baidu's patch. [https://docs.google.com/document/d/1hdh6GNLzprzlSJDDa4UKMyNQ9-u_H-CDpEtvtggxCL0/edit?usp=sharing] Please note that the report can be biased since HWX's patch is authored by myself, but it may be helpful to understand the difference of performance between my patch and Baidu's patch. > Benchmark You can refer the benchmark code what I’ve crafted from below link. [https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e] The test is based on word count sessionization (adopt sessionization example in Spark and migrate to session window function), and the test targets the case: total number of word are limited (around 290), and every batch has massive events for each word. (Sure, there’re many data patterns for session window, and what I tested was only one of, but we can see the performance difference at a glance from a single case, and we can add more cases.) Please refer below how query is composed as well as how test data looks like: [BaseBenchmarkSessionWindowListenerWordCountSessionFunction.scala|https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e#diff-cdba128ef790fcf9395a04a09cdf5cd9] [TestSentences.scala|https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e#diff-5759bf22e2d199da550672458a17498b] Since Baidu’s patch supports Complete mode and Append mode, I ran benchmark with Append mode while comparing HWX’s patch with Baidu’s patch. While Baidu’s patch cannot keep up input rate 200 (it showed max processed rows per second as around 130), HWX’s patch (APPEND mode) can keep the input rate around 23000. (Initial input rate was 1000 but Baidu’s patch got very slowed with consistently showing "Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter".) For me, the huge difference of performance is expected based on the analysis, though the outstanding gap actually surprised me. I’ve also crafted benchmark code for mapGroupsWithState (similar with sessionization example) and compared with HWX’s patch. (Since I can't use same code while testing in this case, the quality of the implementation could affect overall result. I’m not sure I crafted benchmark code fair, or bad for mapGroupsWithState.) Please refer [BaseBenchmarkSessionWindowListenerWordCountMapGroupsWithStateUpdateMode.scala|https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/8ec98012f1a0d7de7c945198a05322eec151836e#diff-8f39b0d79d2a05998bab9e317c0cb67f] While mapGroupsWithState (UPDATE mode) can keep the input rate around 10000, HWX’s patch (UPDATE mode) can keep the input rate around 22000. While I really want to see someone tune my benchmark code on mapGroupsWithState and I'd be happy to run benchmark again, but at a glance, it looks like my patch is not struggling with performance hit. > EventTime based sessionization > ------------------------------ > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming > Reporter: Reynold Xin > Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf, Session > Window Support For Structure Streaming.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org