[
https://issues.apache.org/jira/browse/SPARK-46450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799140#comment-17799140
]
Juan Pumarino commented on SPARK-46450:
---------------------------------------
[~kabhwan] thanks for the explanation; I learned a bit more about how Spark
internals work.
> session_window doesn't identify sessions with provided gap when used as a
> window function
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-46450
> URL: https://issues.apache.org/jira/browse/SPARK-46450
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.4.1, 3.5.0
> Reporter: Juan Pumarino
> Priority: Minor
>
> {{PARTITION BY session_window}} doesn't produce the expected results. Here's
> an example:
> {code:sql}
> SELECT
> id,
> ts,
> collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as
> window_ids
> FROM VALUES
> (1, "2023-12-11 01:10"),
> (2, "2023-12-11 01:15"),
> (3, "2023-12-11 01:40"),
> (4, "2023-12-11 02:05"),
> (5, "2023-12-11 03:15"),
> (6, "2023-12-11 03:20"),
> (7, "2023-12-11 04:10"),
> (8, "2023-12-11 05:05")
> AS tab(id, ts)
> {code}
> Actual result:
> {code:java}
> +---+----------------+----------+
> |id |ts |window_ids|
> +---+----------------+----------+
> |1 |2023-12-11 01:10|[1] |
> |2 |2023-12-11 01:15|[2] |
> |3 |2023-12-11 01:40|[3] |
> |4 |2023-12-11 02:05|[4] |
> |5 |2023-12-11 03:15|[5] |
> |6 |2023-12-11 03:20|[6] |
> |7 |2023-12-11 04:10|[7] |
> |8 |2023-12-11 05:05|[8] |
> +---+----------------+----------+
> {code}
> Expected result, assigning rows to two sessions with 1-hour gap:
> {code:java}
> +---+----------------+------------+
> |id |ts |window_ids |
> +---+----------------+------------+
> |1 |2023-12-11 01:10|[1, 2, 3, 4]|
> |2 |2023-12-11 01:15|[1, 2, 3, 4]|
> |3 |2023-12-11 01:40|[1, 2, 3, 4]|
> |4 |2023-12-11 02:05|[1, 2, 3, 4]|
> |5 |2023-12-11 03:15|[5, 6, 7, 8]|
> |6 |2023-12-11 03:20|[5, 6, 7, 8]|
> |7 |2023-12-11 04:10|[5, 6, 7, 8]|
> |8 |2023-12-11 05:05|[5, 6, 7, 8]|
> +---+----------------+------------+
> {code}
> I compared its behavior with the results as a grouping function and with how
> {{window()}} behaves in both cases, which seems to confirm that the result is
> inconsistent. Here are the other examples:
> *{{group by window()}}*
> {code:sql}
> SELECT
> collect_list(id) AS ids,
> collect_list(ts) AS tss,
> window
> FROM VALUES
> (1, "2023-12-11 01:10"),
> (2, "2023-12-11 01:15"),
> (3, "2023-12-11 01:40"),
> (4, "2023-12-11 02:05"),
> (5, "2023-12-11 03:15"),
> (6, "2023-12-11 03:20"),
> (7, "2023-12-11 04:10"),
> (8, "2023-12-11 05:05")
> AS tab(id, ts)
> GROUP by window(ts, '1 hour')
> {code}
> Correctly assigns rows to 1-hour windows:
> {code:java}
> +---------+------------------------------------------------------+------------------------------------------+
> |ids |tss |window
> |
> +---------+------------------------------------------------------+------------------------------------------+
> |[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11
> 01:00:00, 2023-12-11 02:00:00}|
> |[4] |[2023-12-11 02:05] |{2023-12-11
> 02:00:00, 2023-12-11 03:00:00}|
> |[5, 6] |[2023-12-11 03:15, 2023-12-11 03:20] |{2023-12-11
> 03:00:00, 2023-12-11 04:00:00}|
> |[7] |[2023-12-11 04:10] |{2023-12-11
> 04:00:00, 2023-12-11 05:00:00}|
> |[8] |[2023-12-11 05:05] |{2023-12-11
> 05:00:00, 2023-12-11 06:00:00}|
> +---------+------------------------------------------------------+------------------------------------------+
> {code}
>
> *{{group by session_window()}}*
> {code:sql}
> SELECT
> collect_list(id) AS ids,
> collect_list(ts) AS tss,
> session_window
> FROM VALUES
> (1, "2023-12-11 01:10"),
> (2, "2023-12-11 01:15"),
> (3, "2023-12-11 01:40"),
> (4, "2023-12-11 02:05"),
> (5, "2023-12-11 03:15"),
> (6, "2023-12-11 03:20"),
> (7, "2023-12-11 04:10"),
> (8, "2023-12-11 05:05")
> AS tab(id, ts)
> GROUP by session_window(ts, '1 hour')
> {code}
> Correctly assigns rows to two sessions with 1-hour gap:
> {code:java}
> +------------+------------------------------------------------------------------------+------------------------------------------+
> |ids |tss
> |session_window |
> +------------+------------------------------------------------------------------------+------------------------------------------+
> |[1, 2, 3, 4]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40,
> 2023-12-11 02:05]|{2023-12-11 01:10:00, 2023-12-11 03:05:00}|
> |[5, 6, 7, 8]|[2023-12-11 03:15, 2023-12-11 03:20, 2023-12-11 04:10,
> 2023-12-11 05:05]|{2023-12-11 03:15:00, 2023-12-11 06:05:00}|
> +------------+------------------------------------------------------------------------+------------------------------------------+
> {code}
>
> *{{partition by window()}}*
> {code:sql}
> SELECT
> id,
> ts,
> collect_list(id) OVER (PARTITION BY window(ts, '1 hour')) as window_ids
> FROM VALUES
> (1, "2023-12-11 01:10"),
> (2, "2023-12-11 01:15"),
> (3, "2023-12-11 01:40"),
> (4, "2023-12-11 02:05"),
> (5, "2023-12-11 03:15"),
> (6, "2023-12-11 03:20"),
> (7, "2023-12-11 04:10"),
> (8, "2023-12-11 05:05")
> AS tab(id, ts)
> {code}
> Correctly assigns rows to 1-hour windows:
> {code:java}
> +---+----------------+----------+
> |id |ts |window_ids|
> +---+----------------+----------+
> |1 |2023-12-11 01:10|[1, 2, 3] |
> |2 |2023-12-11 01:15|[1, 2, 3] |
> |3 |2023-12-11 01:40|[1, 2, 3] |
> |4 |2023-12-11 02:05|[4] |
> |5 |2023-12-11 03:15|[5, 6] |
> |6 |2023-12-11 03:20|[5, 6] |
> |7 |2023-12-11 04:10|[7] |
> |8 |2023-12-11 05:05|[8] |
> +---+----------------+----------+
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]