clintropolis opened a new pull request #11188:
URL: https://github.com/apache/druid/pull/11188
### Description
This PR modifies the Druid SQL planner to no longer skip empty buckets when
it generates a timeseries query with 'all' granularity, which should produce
query result behavior that is more SQL compatible.
In SQL when running a query with only aggregators, such as `SELECT COUNT(*),
SUM(x) FROM y WHERE z = 'someval'`, then the aggregation is done as if there is
a single universal grouping key, and so this example query should return
something even if there are no values which match `'someval'`, and should
produce `[0, null]` in this case.
We plan queries like this example into a Druid native timeseries query, and
prior to this patch would always set `skipEmptyBuckets` query context parameter
to 'true', because a timeseries query with a granularity is equivalent to
grouping by a time floor expression, so empty results should be removed in this
case. But when using 'ALL' granularity, such as we do in the example query when
there is no time floor expression to be grouping on, this setting isn't
appropriate, because it makes druid return empty results, instead of the
initialized aggregators for the universal group like it correctly should.
I went through and added tests for I think almost all of our SQL aggregator
functions to capture their current outputs in both default mode and SQL
compatible null handling mode for this empty timeseries result case, and got a
bit carried away refactoring these tests to be simplified and extend
`BaseCalciteQueryTest` which is why the PR is sort of large (but is a nice
improvement for all of these SQL tests).
I'm not actually sure all of these aggregator functions behave correctly,
notably some of the quantiles sketches which return `Double.NaN` in both modes,
and some other sketches which return an empty sketch in both modes, but I'll
save changing the behavior of any of these aggregators for another day.
Since this changes the default behavior of query results in that SQL queries
which plan to timeseries queries with ALL granularity will now always produce a
single result row instead of nothing. The old behavior can be retained by
explicitly setting `skipEmptyBuckets` to true, which will overrule the planner
behavior.
##### Key changed/added classes in this PR
* `DruidQuery.java`
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]