waynexia commented on issue #2994:
URL:
https://github.com/apache/arrow-datafusion/issues/2994#issuecomment-1202253162
Thanks for your reproducer @jiacai2050. Here are some investigations:
Looks like the `single_distinct_to_groupby` optimizer doesn't consider the
original `GROUP BY` contains expression like `a+1` or `b%2`. It copies `GROUP
BY`s to the inner subquery, which may cause two problems:
- Schema doesn't match. The schema of subquery will become something like
`#table.a + Int64(1)`. And thus the outer `GROUP BY` cannot get a column `a`
from it.
- Duplicated evaluation. Expressions in `GROUP BY` cannot be calculated
twice.
To fix this, we need to alias the "original" `GROUP BY` in the subquery and
refer to that aliased column in outer. E.g.:
- Before optimize
```sql
SELECT id+1, count(distinct(bank_account))
FROM users
GROUP BY id+1;
```
- After optimize
```sql
SELECT group_by_alias1 as "id+1", count(bank_account)
FROM (
SELECT id+1 as group_by_alias1, bank_account
FROM users
GROUP BY id+1, bank_account
)
GROUP BY group_by_alias1;
```
I'll submit a fix around this or next week.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]