wiedld opened a new issue, #9870:
URL: https://github.com/apache/arrow-datafusion/issues/9870
### Describe the bug
The common-expression-elimination optimizer errors when there are two
`aggr_fn(<short_circuit>)` projections. Specifically, we have a reproducer with
a failure when running:
```
statement ok
CREATE TABLE t1(
time TIMESTAMP,
load1 DOUBLE,
load2 DOUBLE,
host VARCHAR
) AS VALUES
(to_timestamp_nanos(1527018806000000000), 1.1, 101, 'host1'),
(to_timestamp_nanos(1527018806000000000), 2.2, 202, 'host2'),
(to_timestamp_nanos(1527018806000000000), 3.3, 303, 'host3'),
(to_timestamp_nanos(1527018806000000000), 1.1, 101, NULL)
;
# cannot have 2 projections with aggr(short_circuited), with the same
short-circuited expr (e.g. CASE WHEN)
statement error DataFusion error: Optimizer rule
'common_sub_expression_eliminate' failed
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host,
sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null
then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c1']), sum((case when
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then
t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select
struct(time,load1,load2,host) from t1) t2 where
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
```
We isolated this error to how the
[IdArray](https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/optimizer/src/common_subexpr_eliminate.rs#L68)
is generated and used. The IdArray is generated by
[insertion-at-index](https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/optimizer/src/common_subexpr_eliminate.rs#L692)
during traversal of a first visitor. This IdArray is then [read during the
second visitor, also by an
index](https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/optimizer/src/common_subexpr_eliminate.rs#L754),
but this index is incremented differently as this second visitor traverses.
As such, the second visitor finds the wrong expr, and then inserts the wrong
expression. We could make small changes to fix our bug, but then broke other
statements. We feel that the index-based lookup may be inherently fragile to
slight changes in traversal patterns across the two visitors; as such, we have
a proposed alternative which will be up in PR shortly.
### To Reproduce
Full reproducer:
```
statement ok
CREATE TABLE t1(
time TIMESTAMP,
load1 DOUBLE,
load2 DOUBLE,
host VARCHAR
) AS VALUES
(to_timestamp_nanos(1527018806000000000), 1.1, 101, 'host1'),
(to_timestamp_nanos(1527018806000000000), 2.2, 202, 'host2'),
(to_timestamp_nanos(1527018806000000000), 3.3, 303, 'host3'),
(to_timestamp_nanos(1527018806000000000), 1.1, 101, NULL)
;
# struct scalar function with columns
query ?
select struct(time,load1,load2,host) from t1;
----
{c0: 2018-05-22T19:53:26, c1: 1.1, c2: 101.0, c3: host1}
{c0: 2018-05-22T19:53:26, c1: 2.2, c2: 202.0, c3: host2}
{c0: 2018-05-22T19:53:26, c1: 3.3, c2: 303.0, c3: host3}
{c0: 2018-05-22T19:53:26, c1: 1.1, c2: 101.0, c3: }
# can have an aggregate function with an inner coalesce
query TR
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host,
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']) from
(select struct(time,load1,load2,host) from t1) t2 where
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
----
host1 1.1
host2 2.2
host3 3.3
# can have an aggregate function with an inner CASE WHEN
query TR
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host,
sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null
then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select
struct(time,load1,load2,host) from t1) t2 where
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
----
host1 101
host2 202
host3 303
# can have 2 projections with aggr(short_circuited), with different
short-circuited expr
query TRR
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host,
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']), sum((case
when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then
t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select
struct(time,load1,load2,host) from t1) t2 where
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
----
host1 1.1 101
host2 2.2 202
host3 3.3 303
# can have 2 projections with aggr(short_circuited), with the same
short-circuited expr (e.g. CASE WHEN)
statement error DataFusion error: Optimizer rule
'common_sub_expression_eliminate' failed
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host,
sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null
then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c1']), sum((case when
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then
t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select
struct(time,load1,load2,host) from t1) t2 where
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
# can have 2 projections with aggr(short_circuited), with the same
short-circuited expr (e.g. coalesce)
statement error DataFusion error: Optimizer rule
'common_sub_expression_eliminate' failed
select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host,
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']),
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c2']) from
(select struct(time,load1,load2,host) from t1) t2 where
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
```
### Expected behavior
Should succeed on all of the provided test cases. Should not lookup with
wrong expr, and insert the wrong expression, during the
common-expression-elimination re-writer.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]