wiedld opened a new issue, #9870:
URL: https://github.com/apache/arrow-datafusion/issues/9870

   ### Describe the bug
   
   The common-expression-elimination optimizer errors when there are two 
`aggr_fn(<short_circuit>)` projections. Specifically, we have a reproducer with 
a failure when running:
   
   ```
   statement ok
   CREATE TABLE t1(
       time TIMESTAMP,
       load1 DOUBLE,
       load2 DOUBLE,
       host VARCHAR
   ) AS VALUES
     (to_timestamp_nanos(1527018806000000000), 1.1, 101, 'host1'),
     (to_timestamp_nanos(1527018806000000000), 2.2, 202, 'host2'),
     (to_timestamp_nanos(1527018806000000000), 3.3, 303, 'host3'),
     (to_timestamp_nanos(1527018806000000000), 1.1, 101, NULL)
   ;
   
   # cannot have 2 projections with aggr(short_circuited), with the same 
short-circuited expr (e.g. CASE WHEN)
   statement error DataFusion error: Optimizer rule 
'common_sub_expression_eliminate' failed
   select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, 
sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null 
then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c1']), sum((case when 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select 
struct(time,load1,load2,host) from t1) t2 where 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
   ```
   
   We isolated this error to how the 
[IdArray](https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/optimizer/src/common_subexpr_eliminate.rs#L68)
 is generated and used. The IdArray is generated by 
[insertion-at-index](https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/optimizer/src/common_subexpr_eliminate.rs#L692)
 during traversal of a first visitor. This IdArray is then [read during the 
second visitor, also by an 
index](https://github.com/apache/arrow-datafusion/blob/179179c0b719a7f9e33d138ab728fdc2b0e1e1d8/datafusion/optimizer/src/common_subexpr_eliminate.rs#L754),
 but this index is incremented differently as this second visitor traverses.
   
   As such, the second visitor finds the wrong expr, and then inserts the wrong 
expression. We could make small changes to fix our bug, but then broke other 
statements. We feel that the index-based lookup may be inherently fragile to 
slight changes in traversal patterns across the two visitors; as such, we have 
a proposed alternative which will be up in PR shortly.
   
   ### To Reproduce
   
   Full reproducer:
   ```
   statement ok
   CREATE TABLE t1(
       time TIMESTAMP,
       load1 DOUBLE,
       load2 DOUBLE,
       host VARCHAR
   ) AS VALUES
     (to_timestamp_nanos(1527018806000000000), 1.1, 101, 'host1'),
     (to_timestamp_nanos(1527018806000000000), 2.2, 202, 'host2'),
     (to_timestamp_nanos(1527018806000000000), 3.3, 303, 'host3'),
     (to_timestamp_nanos(1527018806000000000), 1.1, 101, NULL)
   ;
   
   # struct scalar function with columns
   query ?
   select struct(time,load1,load2,host) from t1;
   ----
   {c0: 2018-05-22T19:53:26, c1: 1.1, c2: 101.0, c3: host1}
   {c0: 2018-05-22T19:53:26, c1: 2.2, c2: 202.0, c3: host2}
   {c0: 2018-05-22T19:53:26, c1: 3.3, c2: 303.0, c3: host3}
   {c0: 2018-05-22T19:53:26, c1: 1.1, c2: 101.0, c3: }
   
   # can have an aggregate function with an inner coalesce
   query TR
   select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, 
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']) from 
(select struct(time,load1,load2,host) from t1) t2 where 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
   ----
   host1 1.1
   host2 2.2
   host3 3.3
   
   # can have an aggregate function with an inner CASE WHEN
   query TR
   select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, 
sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null 
then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select 
struct(time,load1,load2,host) from t1) t2 where 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
   ----
   host1 101
   host2 202
   host3 303
   
   # can have 2 projections with aggr(short_circuited), with different 
short-circuited expr
   query TRR
   select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, 
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']), sum((case 
when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select 
struct(time,load1,load2,host) from t1) t2 where 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
   ----
   host1 1.1 101
   host2 2.2 202
   host3 3.3 303
   
   # can have 2 projections with aggr(short_circuited), with the same 
short-circuited expr (e.g. CASE WHEN)
   statement error DataFusion error: Optimizer rule 
'common_sub_expression_eliminate' failed
   select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, 
sum((case when t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null 
then t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c1']), sum((case when 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] is not null then 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)" end)['c2']) from (select 
struct(time,load1,load2,host) from t1) t2 where 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
   
   # can have 2 projections with aggr(short_circuited), with the same 
short-circuited expr (e.g. coalesce)
   statement error DataFusion error: Optimizer rule 
'common_sub_expression_eliminate' failed
   select t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] as host, 
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c1']), 
sum(coalesce(t2."struct(t1.time,t1.load1,t1.load2,t1.host)")['c2']) from 
(select struct(time,load1,load2,host) from t1) t2 where 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] IS NOT NULL group by 
t2."struct(t1.time,t1.load1,t1.load2,t1.host)"['c3'] order by host;
   ```
   
   
   ### Expected behavior
   
   Should succeed on all of the provided test cases. Should not lookup with 
wrong expr, and insert the wrong expression, during the 
common-expression-elimination re-writer.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to