[jira] [Commented] (SPARK-10816) EventTime based sessionization

Jungtaek Lim (JIRA) Mon, 29 Oct 2018 21:34:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668120#comment-16668120
 ]


Jungtaek Lim commented on SPARK-10816:
--------------------------------------

UPDATE:

1. Added additional new data pattern in benchmark code: plenty of sessions in 
key

* setting half of rows as late events evenly distributed in 7 days (1 minute)
* distributing events to 10 keys
* watermark - late events allowance for 7 days
* session gap: 30 seconds

{code}
    val calculateLateEventTimestamp = "floor(CAST(timestamp AS LONG) / 60) * 60 
- (60 * mod(value, 60 * 24 * 7))"

    val events = df.toDF()
      .selectExpr(TestSentences.createCaseExprStr("mod(floor(value / 10), 10)", 
10) + " as word",
        "CASE WHEN mod(value, 2) == 0 THEN timestamp " +
          s"ELSE CAST($calculateLateEventTimestamp AS TIMESTAMP) END " +
          "AS eventTime")
      .withWatermark("eventTime", "7 days")
{code}

https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/blob/master/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_sessions_in_key/BenchmarkSessionWindowListenerWordCountSessionFunction.scala

2. Worked on leveraging linked list and optimizing on integrating previous 
sessions with current rows.

New WIP branch: 
https://github.com/HeartSaVioR/spark/commits/SPARK-10816-WIP-linked-list

3. Also ran some benchmarks. Will post in separate comment.

> EventTime based sessionization
> ------------------------------
>
>                 Key: SPARK-10816
>                 URL: https://issues.apache.org/jira/browse/SPARK-10816
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Reynold Xin
>            Priority: Major
>         Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-10816) EventTime based sessionization

Reply via email to