clintropolis opened a new pull request #11157:
URL: https://github.com/apache/druid/pull/11157
### Description
This PR adds `ARRAY_AGG` to SQL, building it on top of the expression
aggregator provided in #11104.
| function | description |
| --- | --- |
|`ARRAY_AGG(expr)`|Collects all values of `expr` into an ARRAY, including
null values, with the default limit on aggregation size of 1024 bytes. `ORDER
BY` on the `ARRAY_AGG` expression is not currently supported.|
|`ARRAY_AGG(DISTINCT expr)`|Collects all distinct values of `expr` into an
ARRAY, including null values, with the default limit on aggregation size of
1024 bytes per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not
currently supported.|
|`ARRAY_AGG(expr, maxSizeBytes)`|Collects all values of `expr` into an
ARRAY, including null values, with specified maximum byte size per aggregate.
`ORDER BY` on the `ARRAY_AGG` expression is not currently supported.|
|`ARRAY_AGG(DISTINCT expr, maxSizeBytes)`|Collects all distinct values of
`expr` into an ARRAY, including null values, with specified maximum byte size
per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently
supported.|
Since this is an expression aggregator, it is unlikely to be the most
optimal way to provide this functionality, But, it importantly makes it so that
we _can_ offer this functionality relatively easily, since arrays are more or
less only supported in the expression system at this point, so it is still an
improvement.
Until we update Apache Calcite to a newer version, we are unable to support
the `ORDER BY` that some other databases which offer `ARRAY_AGG` have, due to
https://issues.apache.org/jira/browse/CALCITE-4335. Upgrading Calcite to any
newer version is tricky because of requirements of a newer Guava dependency.
We also will accept a non-standard 2 argument form of `ARRAY_AGG` as
specified above, so that we can specify the maximum buffer aggregator size in
bytes, which will be necessary until we have growable buffer aggregators.
I added a lot of SQL query tests to try and stress some areas of array
handling in addition to testing this aggregator itself, including expected
failure tests such as trying to group on the output of an array agg output of a
subquery or join.
It was admittedly rather pleasing to see stuff like
```sql
SELECT dim2, COUNT(*)
FROM foo
WHERE ARRAY_CONTAINS((SELECT ARRAY_AGG(DISTINCT dim1) FROM foo WHERE dim1 is
not null), dim1)
GROUP BY 1
```
able to just work out of the box after I added the aggregator function, so I
consider array support overall to be progressing pretty smoothly.
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]