[GitHub] [druid] clintropolis opened a new pull request #11188: SQL timeseries no longer skip empty buckets with all granularity

GitBox Sun, 02 May 2021 16:54:16 -0700


clintropolis opened a new pull request #11188:
URL: https://github.com/apache/druid/pull/11188



   ### Description
   This PR modifies the Druid SQL planner to no longer skip empty buckets when 
it generates a timeseries query with 'all' granularity, which should produce 
query result behavior that is more SQL compatible.
   
   In SQL when running a query with only aggregators, such as `SELECT COUNT(*), 
SUM(x) FROM y WHERE z = 'someval'`, then the aggregation is done as if there is 
a single universal grouping key, and so this example query should return 
something even if there are no values which match `'someval'`, and should 
produce `[0, null]` in this case. 
   
   We plan queries like this example into a Druid native timeseries query, and 
prior to this patch would always set `skipEmptyBuckets` query context parameter 
to 'true', because a timeseries query with a granularity is equivalent to 
grouping by a time floor expression, so empty results should be removed in this 
case. But when using 'ALL' granularity, such as we do in the example query when 
there is no time floor expression to be grouping on, this setting isn't 
appropriate, because it makes druid return empty results, instead of the 
initialized aggregators for the universal group like it correctly should.
   
   I went through and added tests for I think almost all of our SQL aggregator 
functions to capture their current outputs in both default mode and SQL 
compatible null handling mode for this empty timeseries result case, and got a 
bit carried away refactoring these tests to be simplified and extend 
`BaseCalciteQueryTest` which is why the PR is sort of large (but is a nice 
improvement for all of these SQL tests).
   
   I'm not actually sure all of these aggregator functions behave correctly, 
notably some of the quantiles sketches which return `Double.NaN` in both modes, 
and some other sketches which return an empty sketch in both modes, but I'll 
save changing the behavior of any of these aggregators for another day.
   
   Since this changes the default behavior of query results in that SQL queries 
which plan to timeseries queries with ALL granularity will now always produce a 
single result row instead of nothing. The old behavior can be retained by 
explicitly setting `skipEmptyBuckets` to true, which will overrule the planner 
behavior.
   
   ##### Key changed/added classes in this PR
   * `DruidQuery.java`
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request #11188: SQL timeseries no longer skip empty buckets with all granularity

Reply via email to