[
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668120#comment-16668120
]
Jungtaek Lim commented on SPARK-10816:
--------------------------------------
UPDATE:
1. Added additional new data pattern in benchmark code: plenty of sessions in
key
* setting half of rows as late events evenly distributed in 7 days (1 minute)
* distributing events to 10 keys
* watermark - late events allowance for 7 days
* session gap: 30 seconds
{code}
val calculateLateEventTimestamp = "floor(CAST(timestamp AS LONG) / 60) * 60
- (60 * mod(value, 60 * 24 * 7))"
val events = df.toDF()
.selectExpr(TestSentences.createCaseExprStr("mod(floor(value / 10), 10)",
10) + " as word",
"CASE WHEN mod(value, 2) == 0 THEN timestamp " +
s"ELSE CAST($calculateLateEventTimestamp AS TIMESTAMP) END " +
"AS eventTime")
.withWatermark("eventTime", "7 days")
{code}
https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/blob/master/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_sessions_in_key/BenchmarkSessionWindowListenerWordCountSessionFunction.scala
2. Worked on leveraging linked list and optimizing on integrating previous
sessions with current rows.
New WIP branch:
https://github.com/HeartSaVioR/spark/commits/SPARK-10816-WIP-linked-list
3. Also ran some benchmarks. Will post in separate comment.
> EventTime based sessionization
> ------------------------------
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Reporter: Reynold Xin
> Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session
> Window Support For Structure Streaming.pdf
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]