[GitHub] [druid] clintropolis opened a new pull request #11157: ARRAY_AGG sql aggregator function

GitBox Fri, 23 Apr 2021 02:32:56 -0700


clintropolis opened a new pull request #11157:
URL: https://github.com/apache/druid/pull/11157



   ### Description
   This PR adds `ARRAY_AGG` to SQL, building it on top of the expression 
aggregator provided in #11104.
   
   | function | description |
   | --- | --- |
   |`ARRAY_AGG(expr)`|Collects all values of `expr` into an ARRAY, including 
null values, with the default limit on aggregation size of 1024 bytes. `ORDER 
BY` on the `ARRAY_AGG` expression is not currently supported.|
   |`ARRAY_AGG(DISTINCT expr)`|Collects all distinct values of `expr` into an 
ARRAY, including null values, with the default limit on aggregation size of 
1024 bytes per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not 
currently supported.|
   |`ARRAY_AGG(expr, maxSizeBytes)`|Collects all values of `expr` into an 
ARRAY, including null values, with specified maximum byte size per aggregate. 
`ORDER BY` on the `ARRAY_AGG` expression is not currently supported.|
   |`ARRAY_AGG(DISTINCT expr, maxSizeBytes)`|Collects all distinct values of 
`expr` into an ARRAY, including null values, with specified maximum byte size 
per aggregate. `ORDER BY` on the `ARRAY_AGG` expression is not currently 
supported.|
   
    Since this is an expression aggregator, it is unlikely to be the most 
optimal way to provide this functionality, But, it importantly makes it so that 
we _can_ offer this functionality relatively easily, since arrays are more or 
less only supported in the expression system at this point, so it is still an 
improvement.
   
   Until we update Apache Calcite to a newer version, we are unable to support 
the `ORDER BY` that some other databases which offer `ARRAY_AGG` have, due to 
https://issues.apache.org/jira/browse/CALCITE-4335. Upgrading Calcite to any 
newer version is tricky because of requirements of a newer Guava dependency.
   
   We also will accept a non-standard 2 argument form of `ARRAY_AGG` as 
specified above, so that we can specify the maximum buffer aggregator size in 
bytes, which will be necessary until we have growable buffer aggregators.
   
   I added a lot of SQL query tests to try and stress some areas of array 
handling in addition to testing this aggregator itself, including expected 
failure tests such as trying to group on the output of an array agg output of a 
subquery or join.
   
   It was admittedly rather pleasing to see stuff like
   
   ```sql
   SELECT dim2, COUNT(*)
   FROM foo
   WHERE ARRAY_CONTAINS((SELECT ARRAY_AGG(DISTINCT dim1) FROM foo WHERE dim1 is 
not null), dim1)
   GROUP BY 1
   ```
   able to just work out of the box after I added the aggregator function, so I 
consider array support overall to be progressing pretty smoothly.
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request #11157: ARRAY_AGG sql aggregator function

Reply via email to