james727 opened a new issue #1512:
URL: https://github.com/apache/arrow-datafusion/issues/1512
**Describe the bug**
It seems like distinct aggregations (e.g. `SELECT SUM(DISTINCT <expr>) ...`,
`SELECT ARRAY_AGG(DISTINCT <expr>)...`) are only partially supported by
DataFusion. In specific circumstances they work as expected, and in other cases
they throw errors.
Here's an example of a simple query that works as intended - to repro, add
the following test case into `datafusion/tests/sql/aggregates.rs`:
```rust
#[tokio::test]
async fn csv_query_array_agg_distinct() -> Result<()> {
let mut ctx = ExecutionContext::new();
register_aggregate_csv(&mut ctx).await?;
let sql =
"SELECT array_agg(DISTINCT c2) FROM aggregate_test_100";
let actual = execute_to_batches(&mut ctx, sql).await;
let expected = vec![
"+-----------------------------------------------+",
"| ARRAYAGG(DISTINCT aggregate_test_100.c2) |",
"+-----------------------------------------------+",
"| [4, 2, 3, 5, 1]
|",
"+-----------------------------------------------+",
];
assert_batches_eq!(expected, &actual);
Ok(())
}
```
This works (well - it fails due to nondeterministic ordering, but it works
as intended).
Here are some queries that fail:
- `SELECT array_agg(DISTINCT 2 * c2) FROM aggregate_test_100` - Fails with
`Plan("No field named 'aggregate_test_100.c2'. Valid fields are 'Int64(5) *
aggregate_test_100.c2'.")`
- `SELECT array_agg(DISTINCT c2), sum(distinct 5*c2) c3 FROM
aggregate_test_100` - Fails with `NotImplemented("SUM(DISTINCT) aggregations
are not available")`
I've narrowed the issue down to the optimization logic in
`single_distinct_to_groupby.rs` - in fact, if I disable this optimization rule,
even the basic cases don't work correctly. This suggests that this may be
working accidentally?
**To Reproduce**
Repro steps above.
**Expected behavior**
Either this should be disabled entirely (similar to the `NotImplemented`
error message for certain queries involving sum), or the queries above should
be fixed.
**Additional context**
I found this when working through
https://github.com/apache/arrow-datafusion/issues/1323 - I may be able to take
a pass at fixing this, but it'd be helpful to get thoughts from people more
familiar with the project on the appropriate path forward here first.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]