Juan Pumarino created SPARK-46450:
-------------------------------------
Summary: session_window doesn't identify sessions with provided
gap when used as a window function
Key: SPARK-46450
URL: https://issues.apache.org/jira/browse/SPARK-46450
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.5.0, 3.4.1
Reporter: Juan Pumarino
{{PARTITION BY session_window}} doesn't produce the expected results. Here's an
example:
{code:sql}
SELECT
id,
ts,
collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as
window_ids
FROM VALUES
(1, "2023-12-11 01:10"),
(2, "2023-12-11 01:15"),
(3, "2023-12-11 01:40"),
(4, "2023-12-11 02:05"),
(5, "2023-12-11 03:15"),
(6, "2023-12-11 03:20"),
(7, "2023-12-11 04:10"),
(8, "2023-12-11 05:05")
AS tab(id, ts)
{code}
Actual result:
{code:java}
+---+----------------+----------+
|id |ts |window_ids|
+---+----------------+----------+
|1 |2023-12-11 01:10|[1] |
|2 |2023-12-11 01:15|[2] |
|3 |2023-12-11 01:40|[3] |
|4 |2023-12-11 02:05|[4] |
|5 |2023-12-11 03:15|[5] |
|6 |2023-12-11 03:20|[6] |
|7 |2023-12-11 04:10|[7] |
|8 |2023-12-11 05:05|[8] |
+---+----------------+----------+
{code}
Expected result, assigning rows to two sessions with 1-hour gap:
{code:java}
+---+----------------+------------+
|id |ts |window_ids |
+---+----------------+------------+
|1 |2023-12-11 01:10|[1, 2, 3, 4]|
|2 |2023-12-11 01:15|[1, 2, 3, 4]|
|3 |2023-12-11 01:40|[1, 2, 3, 4]|
|4 |2023-12-11 02:05|[1, 2, 3, 4]|
|5 |2023-12-11 03:15|[5, 6, 7, 8]|
|6 |2023-12-11 03:20|[5, 6, 7, 8]|
|7 |2023-12-11 04:10|[5, 6, 7, 8]|
|8 |2023-12-11 05:05|[5, 6, 7, 8]|
+---+----------------+------------+
{code}
I compared its behavior with the results as a grouping function and with how
{{window()}} behaves in both cases, which seems to confirm that the result is
inconsistent. Here are the other examples:
*{{group by window()}}*
{code:sql}
SELECT
collect_list(id) AS ids,
collect_list(ts) AS tss,
window
FROM VALUES
(1, "2023-12-11 01:10"),
(2, "2023-12-11 01:15"),
(3, "2023-12-11 01:40"),
(4, "2023-12-11 02:05"),
(5, "2023-12-11 03:15"),
(6, "2023-12-11 03:20"),
(7, "2023-12-11 04:10"),
(8, "2023-12-11 05:05")
AS tab(id, ts)
GROUP by window(ts, '1 hour')
{code}
Correctly assigns rows to 1-hour windows:
{code:java}
+---------+------------------------------------------------------+------------------------------------------+
|ids |tss |window
|
+---------+------------------------------------------------------+------------------------------------------+
|[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11
01:00:00, 2023-12-11 02:00:00}|
|[4] |[2023-12-11 02:05] |{2023-12-11
02:00:00, 2023-12-11 03:00:00}|
|[5, 6] |[2023-12-11 03:15, 2023-12-11 03:20] |{2023-12-11
03:00:00, 2023-12-11 04:00:00}|
|[7] |[2023-12-11 04:10] |{2023-12-11
04:00:00, 2023-12-11 05:00:00}|
|[8] |[2023-12-11 05:05] |{2023-12-11
05:00:00, 2023-12-11 06:00:00}|
+---------+------------------------------------------------------+------------------------------------------+
{code}
*{{group by session_window()}}*
{code:sql}
SELECT
collect_list(id) AS ids,
collect_list(ts) AS tss,
session_window
FROM VALUES
(1, "2023-12-11 01:10"),
(2, "2023-12-11 01:15"),
(3, "2023-12-11 01:40"),
(4, "2023-12-11 02:05"),
(5, "2023-12-11 03:15"),
(6, "2023-12-11 03:20"),
(7, "2023-12-11 04:10"),
(8, "2023-12-11 05:05")
AS tab(id, ts)
GROUP by session_window(ts, '1 hour')
{code}
Correctly assigns rows to two sessions with 1-hour gap:
{code:java}
+------------+------------------------------------------------------------------------+------------------------------------------+
|ids |tss
|session_window |
+------------+------------------------------------------------------------------------+------------------------------------------+
|[1, 2, 3, 4]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40, 2023-12-11
02:05]|{2023-12-11 01:10:00, 2023-12-11 03:05:00}|
|[5, 6, 7, 8]|[2023-12-11 03:15, 2023-12-11 03:20, 2023-12-11 04:10, 2023-12-11
05:05]|{2023-12-11 03:15:00, 2023-12-11 06:05:00}|
+------------+------------------------------------------------------------------------+------------------------------------------+
{code}
*{{partition by window()}}*
{code:sql}
SELECT
id,
ts,
collect_list(id) OVER (PARTITION BY window(ts, '1 hour')) as window_ids
FROM VALUES
(1, "2023-12-11 01:10"),
(2, "2023-12-11 01:15"),
(3, "2023-12-11 01:40"),
(4, "2023-12-11 02:05"),
(5, "2023-12-11 03:15"),
(6, "2023-12-11 03:20"),
(7, "2023-12-11 04:10"),
(8, "2023-12-11 05:05")
AS tab(id, ts)
{code}
Correctly assigns rows to 1-hour windows:
{code:java}
+---+----------------+----------+
|id |ts |window_ids|
+---+----------------+----------+
|1 |2023-12-11 01:10|[1, 2, 3] |
|2 |2023-12-11 01:15|[1, 2, 3] |
|3 |2023-12-11 01:40|[1, 2, 3] |
|4 |2023-12-11 02:05|[4] |
|5 |2023-12-11 03:15|[5, 6] |
|6 |2023-12-11 03:20|[5, 6] |
|7 |2023-12-11 04:10|[7] |
|8 |2023-12-11 05:05|[8] |
+---+----------------+----------+
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]