[GitHub] [druid] clintropolis opened a new pull request, #12627: split out null value index

GitBox Fri, 10 Jun 2022 02:30:31 -0700


clintropolis opened a new pull request, #12627:
URL: https://github.com/apache/druid/pull/12627


   ### Description
   This PR splits out `NULL` value index into their own construct for filter 
processing for two reasons. The first is that the new index structure, 
introduced in #12388 means that we can now provide basically any index we can 
imagine, and I imagined by having a separate `NullValueIndex` for 
`ColumnIndexSupplier` to provide to `Filter`, we can provide limited indexes 
for existing Druid numeric columns (if 
`druid.generic.useDefaultValueForFalse=false`). This improves performance quite 
nicely when using numeric columns in `IS NULL`/`IS NOT NULL` style queries.
   
   ```
         // 42,43: filter numeric nulls
         "SELECT SUM(long5) FROM foo WHERE long5 IS NOT NULL",
         "SELECT string2, SUM(long5) FROM foo WHERE long5 IS NOT NULL GROUP BY 
1"
   ```
   before:
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt    Score    Error  Units
   SqlExpressionBenchmark.querySql       42           5000000        false  
avgt    5  107.308 ±  3.975  ms/op
   SqlExpressionBenchmark.querySql       42           5000000        force  
avgt    5  118.790 ±  1.792  ms/op
   SqlExpressionBenchmark.querySql       43           5000000        false  
avgt    5  259.829 ± 12.465  ms/op
   SqlExpressionBenchmark.querySql       43           5000000        force  
avgt    5  241.353 ± 10.005  ms/op
   ```
   after:
   ```
   Benchmark                        (query)  (rowsPerSegment)  (vectorize)  
Mode  Cnt    Score   Error  Units
   SqlExpressionBenchmark.querySql       42           5000000        false  
avgt    5   71.755 ± 3.011  ms/op
   SqlExpressionBenchmark.querySql       42           5000000        force  
avgt    5   55.674 ± 1.288  ms/op
   SqlExpressionBenchmark.querySql       43           5000000        false  
avgt    5  241.125 ± 5.544  ms/op
   SqlExpressionBenchmark.querySql       43           5000000        force  
avgt    5  176.183 ± 7.409  ms/op
   ```
   
   The second reason to split null value indexes into their own thing, is to 
set the stage towards allowing our filter behavior to be SQL compliant. With 
this explicit null value index in place, if we modify the other index providers 
to never match null, we should make it a lot harder for filters to accidentally 
match nulls using indexes, and could allow for the cursor builder to use of 
implicit 'is not null' indexes for columns which are not explicitly being 
matched for 'is null'. 
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request, #12627: split out null value index

Reply via email to