[GitHub] [incubator-druid] clintropolis opened a new pull request #8822: optimize numeric column null value checking for low filter selectivity (more rows)

GitBox Mon, 04 Nov 2019 17:35:37 -0800

clintropolis opened a new pull request #8822: optimize numeric column null 
value checking for low filter selectivity (more rows)
URL: https://github.com/apache/incubator-druid/pull/8822
 
 
   ### Description
   
   I would like to add documentation for Druid SQL compatible null handling 
added in #4349 to the website, in part because it has been merged for quite a 
while now, but also so that we can feel better about the changes #8566 causes 
in SQL behavior in Druids native mode (tl;dr some calcite optimizations lead to 
some query incompatibilities specifically around our treatment of `''` and 
`null` as equivalent).
   
   Part of like.. being responsible and stuff, before adding this documentation 
I first wanted to collect some data to determine if it _was a good idea to in 
fact document it at this time_. The place I was specifically worried about was 
the `isNull` check, so I added a benchmark, 
`NullHandlingBitmapGetVsIteratorBenchmark`,  and ran some experiments to see 
what to expect as well as if we could do better.
   
   The nature of selectors lends itself to using a bitmap iterator as an 
alternative to calling bitmap.get for every `isNull` check, so I first tested 
`get` vs `iterator` with various null bitmap densities and filter selectivities 
to simulate the overhead of having null value handling selector with each 
approach. I have so far only collected information on roaring, because with 
`concise` the `get` method would not complete at low selectivity, so I'm 
assuming it's numbers are quite bad (and more on this later).
   
   To help make sense of the numbers collected from this, I created heatmaps to 
compare the differences between the approaches. With the raw output, these 
looked something like this which isn't so telling:
   
   
![raw](https://user-images.githubusercontent.com/1577461/68170194-b52bde00-ff23-11e9-9e84-2047cc67df72.gif)
   
   Interpolating the data to fill in the gaps yields something a bit prettier, 
but still not incredibly telling on it's own:
   
![difference-iterator-get](https://user-images.githubusercontent.com/1577461/68170242-dbea1480-ff23-11e9-923e-67872816a1a3.gif)
   But does start to offer that the iterator is better at dealing with a larger 
number of rows, diminishing as the density of the bitmap increases.
   
   Scaling the data points to estimate the _cost per row_ in nanoseconds 
provides a much more useful picture, and shows the abysmal performance of using 
the iterator at super high selectivity (e.g. 0.1% of rows match and are 
selected) with really dense bitmaps:
   
![difference-per-row-iterator-get](https://user-images.githubusercontent.com/1577461/68170352-426f3280-ff24-11e9-9782-7a47892f0f18.gif)
   However if we truncate the results and compare areas where the `iterator` is 
better:
   
![difference-per-row-iterator-better](https://user-images.githubusercontent.com/1577461/68170371-53b83f00-ff24-11e9-989e-41ec0219d29c.gif)
   to where `get` is better
   
![difference-per-row-get-better](https://user-images.githubusercontent.com/1577461/68170380-5b77e380-ff24-11e9-8f4f-538004046187.gif)
   it does look like the iterator does perform up to 15 `ns per row` better 
than using `get` when processing a lot of rows.
   
   Searching for a way around the limitation, I was unaware that the 
`IntIterator` for roaring was actually a `PeekableIntIterator` which is an 
iterator that offers an `advanceIfNeeded` method that allows skipping the 
iterator ahead to an index. `ConciseSet` actually has a similar method on it's 
iterator, `skipAllBefore`, which _is used by_ the `contains` method that the 
`get` of concise uses! This is likely why concise just flat out wouldn't 
complete the benchmarks when processing a higher number of rows, because every 
`get` is creating an iterator, skipping to the position, loop checking to see 
if the iterator contains the index or passes it, and then throwing it away.
   
   Adding the 'peekable' iterator to the benchmark had a similar outcome to 
using the plain iterator, 
   
   
![difference-peekable-iterator-get](https://user-images.githubusercontent.com/1577461/68170721-c8d84400-ff25-11e9-9c10-47d5ebf40933.gif)
   
   but without nearly the overhead at high selectivity on dense bitmaps. 
   
![difference-per-row-peekable-iterator-get](https://user-images.githubusercontent.com/1577461/68170735-d392d900-ff25-11e9-8476-7bb2728fe43b.gif)
   It's still worse, but not nearly as bad, no more than 60ns slower per row 
when processing a small number of rows than get, compared to over 2 
microseconds for using the plain iterator.
   
   Peekable iterator better:
   
   
![difference-per-row-peekable-iterator-better](https://user-images.githubusercontent.com/1577461/68170780-f91fe280-ff25-11e9-86b8-6ff026d4813e.gif)
   
   Get better than peekable iterator:
   
![difference-per-row-peekable-iterator-get-better](https://user-images.githubusercontent.com/1577461/68170798-01781d80-ff26-11e9-9daf-ca95788bdb71.gif)
   
   Finally, to get a handle on the per row cost in nanoseconds of checking for 
nulls at all, I threw together these animations:
   
   get:
   
![get-per-row-breakdown](https://user-images.githubusercontent.com/1577461/68171443-a3006e80-ff28-11e9-87be-3894c0978e73.gif)
   
   iterator:
   
![iterator-per-row-breakdown](https://user-images.githubusercontent.com/1577461/68171452-a85db900-ff28-11e9-817a-833b1f64f823.gif)
   
   peekable iterator:
   
![peekable-iterator-per-row-breakdown](https://user-images.githubusercontent.com/1577461/68171456-ae539a00-ff28-11e9-999b-241ee39de895.gif)
   
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added unit tests or modified existing tests to cover new code paths.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `ImmutableBitmap`
    * `OurBar`
    * `TheirBaz`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [incubator-druid] clintropolis opened a new pull request #8822: optimize numeric column null value checking for low filter selectivity (more rows)

Reply via email to