adragomir opened a new issue, #14540:
URL: https://github.com/apache/datafusion/issues/14540
### Describe the bug
With the following table structure:
```
CREATE OR REPLACE TABLE t1 (
date DATE,
timestamp TIMESTAMP_S,
ids STRUCT(
id1 VARCHAR,
extra INT4
),
struct STRUCT(
var1 VARCHAR,
extra VARCHAR
)
);
```
and the following query
```
WITH events AS (
SELECT
ids.id1 as device,
struct.var1 as user,
timestamp
FROM t1
WHERE
date='2025-01-03'
)
SELECT
*,
LAG(user, 1) OVER (PARTITION BY device ORDER BY timestamp) AS
prev
FROM events
WHERE
device IS NOT NULL AND device != ''
AND user IS NOT NULL AND user != ''
LIMIT 100
```
I get into a situation where the 2 optimization steps given above fight
between them:
1. First, push_down_filter pushes the filter down
```
Projection: events.device, events.user, events.timestamp,
lag(events.user,Int64(1)) PARTITION BY [events.device] ORDER BY
[events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW AS prev
Limit: skip=0, fetch=100
WindowAggr: windowExpr=[[lag(events.user, Int64(1)) PARTITION BY
[events.device] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW]]
SubqueryAlias: events
Projection: get_field(t1.ids, Utf8("id1")) AS device,
get_field(t1.struct, Utf8("var1")) AS user, t1.timestamp
Filter: get_field(t1.ids, Utf8("id1")) IS NOT NULL AND
get_field(t1.ids, Utf8("id1")) != Utf8("") AND get_field(t1.struct,
Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("")
AND t1._ACP_DATE = Date32("2025-01-03")
TableScan: t1, partial_filters=[get_field(t1.ids,
Utf8("id1")) IS NOT NULL, get_field(t1.ids, Utf8("id1")) != Utf8(""),
get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct,
Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")]
```
2. Then the common_sub_expression_eliminate acts on the `Filter`, and of
course, adds a `Projection` below it to contain the aliases:
```
Projection: events.DeviceId, events.UserId, events.timestamp,
lag(events.UserId,Int64(1)) PARTITION BY [events.DeviceId] ORDER BY
[events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW AS PreviousUserColName
Limit: skip=0, fetch=100
WindowAggr: windowExpr=[[lag(events.UserId, Int64(1)) PARTITION BY
[events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW]]
SubqueryAlias: events
Projection: get_field(t1.endUserIDs, Utf8("id1")) AS DeviceId,
get_field(t1.struct, Utf8("var1")) AS UserId, t1.timestamp
Filter: __common_expr_1 IS NOT NULL AND __common_expr_1 !=
Utf8("") AND get_field(t1.struct, Utf8("var1")) IS NOT NULL AND
get_field(t1.struct, Utf8("var1")) != Utf8("") AND t1._ACP_DATE =
Date32("2025-01-03")
Projection: get_field(t1.endUserIDs, Utf8("id1")) AS
__common_expr_1, t1._ACP_DATE, t1.timestamp, t1.endUserIDs, t1.struct
TableScan: t1 projection=[_ACP_DATE, timestamp,
endUserIDs, struct], partial_filters=[get_field(t1.endUserIDs, Utf8("id1")) IS
NOT NULL, get_field(t1.endUserIDs, Utf8("id1")) != Utf8(""),
get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct,
Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")]
```
3. At the next run, `push_down_filter` sees a Filter on top a Projection and
pushes it down, deleting the alias probably because Filters can't have aliases
```
Projection: events.DeviceId, events.UserId, events.timestamp,
lag(events.UserId,Int64(1)) PARTITION BY [events.DeviceId] ORDER BY
[events.timestamp ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW AS PreviousUserColName
Limit: skip=0, fetch=100
WindowAggr: windowExpr=[[lag(events.UserId, Int64(1)) PARTITION BY
[events.DeviceId] ORDER BY [events.timestamp ASC NULLS LAST] RANGE BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW]]
SubqueryAlias: events
Projection: get_field(t1.endUserIDs, Utf8("id1")) AS DeviceId,
get_field(t1.struct, Utf8("var1")) AS UserId, t1.timestamp
Projection: get_field(t1.endUserIDs, Utf8("id1")) AS
__common_expr_1, t1._ACP_DATE, t1.timestamp, t1.endUserIDs, t1.struct
Filter: get_field(t1.endUserIDs, Utf8("id1")) IS NOT NULL
AND get_field(t1.endUserIDs, Utf8("id1")) != Utf8("") AND get_field(t1.struct,
Utf8("var1")) IS NOT NULL AND get_field(t1.struct, Utf8("var1")) != Utf8("")
AND t1._ACP_DATE = Date32("2025-01-03")
TableScan: t1 projection=[_ACP_DATE, timestamp,
endUserIDs, struct], partial_filters=[get_field(t1.endUserIDs, Utf8("id1")) IS
NOT NULL, get_field(t1.endUserIDs, Utf8("id1")) != Utf8(""),
get_field(t1.struct, Utf8("var1")) IS NOT NULL, get_field(t1.struct,
Utf8("var1")) != Utf8(""), t1._ACP_DATE = Date32("2025-01-03")]
```
And no progress is made.
This seems like it could happen in other cases. Maybe we could have
something like hints, like a flag on the logical plan that can say that the
plan was added during a previous optimization step ? The problem seems
complicated though.
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]