[PR] refactor numeric primitive aggregators in sql compatible mode (druid)

via GitHub Mon, 11 Mar 2024 18:25:54 -0700


clintropolis opened a new pull request, #10666:
URL: https://github.com/apache/druid/pull/10666


   ### Description
   
   #10219 added `hasNulls` to `ColumnCapabilities`, allowing things to know if 
a column has any null values in SQL compatible mode 
(`druid.generic.useDefaultValueForNull=false`). With this information 
available, `NullableNumericAggregatorFactory` can be modified to use a version 
of the null-aware wrapper aggregators which do not need to check `isNull` on 
each aggregate call, which should reduce the overhead of enabling this mode for 
columns which do not have null values (I haven't actually measured this yet so 
I'm unsure of the difference).
   
   To achieve this, I have pulled out most of the logic of 
`NullableNumericAggregator`, `NullableNumericBufferAggregator`, and 
`NullableNumericVectorAggregator` into a new set of abstract classes, 
`NullAwareNumericAggregator`, `NullAwareNumericVectorAggregator`, and 
`NullAwareNumericVectorAggregator` respectively which the former now extend. 
For processing columns which do not have null values, a new set of 'non-null' 
aggregator wrappers have been introduced, `NonnullNumericAggregator`, 
`NonnullNumericBufferAggregator`, and `NonnullNumericVectorAggregator`, which 
also extend the 'null-aware' base classes, so that they initialize to a null 
value and are compatible with the expectations of aggregator behavior with 
filtering (which reasonably expect an aggregator 'get' to produce the correct 
result, even if no values were aggregated), but can skip the 'is null check'.
   
   `NullableNumericAggregatorFactory` (which should probably more correctly be 
renamed `NullAwareNumericAggregatorFactory` but its marked `@ExtensionPoint` so 
I tried not to mess with it too much), has also been expanded to include a new 
method to check if the aggregator input has null values, with a default 
implementation of:
   ```
     /**
      * Returns true if the aggregator will actually produce null values given 
its input selectors, e.g. if
      * the inputs to the aggregator have any nulls.
      */
     protected boolean hasNulls(ColumnInspector inspector)
     {
       return sqlCompatible;
     }
   ```
   
   with implementations that override in the 'simple' aggregator factories. 
Since there was a lot of duplicated code between the 'simple' aggregator 
factories (`SimpleLongAggregatorFactory`, `SimpleFloatAggregatorFactory`, 
`SimpleDoubleAggregatorFactory`), I have introduced yet another base class, 
`SimpleNumericAggregatorFactory` to consolidate all of the 
'fieldName'/'expression' handling stuff.
   
   ```
   public abstract class SimpleNumericAggregatorFactory<TValueSelector extends 
BaseNullableColumnValueSelector>
       extends NullableNumericAggregatorFactory<ColumnValueSelector>
   ```
   which also handles the 'hasNulls' check override of 
`NullableNumericAggregatorFactory`. This class could probably be consolidated 
with `NullableNumericAggregatorFactory`, but since that class is 
`@ExtensionPoint`, I avoided making this change at this time.
   
   I'll try to find some time to measure the difference, but naively it seems 
obvious that using the 'non-null' family of aggregators should be better due to 
having fewer method calls and branching opportunities.
   
   There is a potential further optimization that callers which "know" that 
there are values to aggregate (e.g. no filter) could avoid the extra byte of 
overhead used for null tracking for columns which don't have any null values by 
modifying the 'factorize' methods to let callers communicate this information 
(aggregators don't know such things), but I haven't made this modification at 
this time since this PR was already starting to get a bit big, so can be done 
as a follow-up.
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist above are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `NullableNumericAggregatorFactory`
    * `NullAwareNumericAggregator`
    * `NullAwareNumericBufferAggregator`
    * `NullAwareNumericVectorAggregator`
    * `NullableNumericAggregator`
    * `NullableNumericBufferAggregator`
    * `NullableNumericVectorAggregator`
    * `NonnullNumericAggregator`
    * `NonnullNumericBufferAggregator`
    * `NonnullNumericVectorAggregator`
    * `SimpleNumericAggregatorFactory`
    * `SimpleLongAggregatorFactory`
    * `SimpleFloatAggregatorFactory`
    * `SimpleDoubleAggregatorFactory`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] refactor numeric primitive aggregators in sql compatible mode (druid)

Reply via email to