rohangarg opened a new pull request, #13133:
URL: https://github.com/apache/druid/pull/13133
Currently, for checking the in-filter values in a column dictionary we use a
binary search per value in the set. That works well for smaller value-sets but
starts slowing down as the number of values in the set increase. To accommodate
for large value-sets arising from large in-filters or from joins being pushed
down as in-filters, we use sorted merge algorithm for merging the set and
dictionary for larger values.
The following benchmark was run to find the cutoff point :
```
Benchmark (dictionarySize) (filterToDictionaryPercentage)
(selectivityPercentage) Mode Cnt Score Error Units
binarySearch 1000000 1
10 avgt 10 11271.575 ± 621.620 us/op
binarySearch 1000000 1
100 avgt 10 16751.127 ± 580.550 us/op
binarySearch 1000000 2
10 avgt 10 25387.245 ± 1795.582 us/op
binarySearch 1000000 2
100 avgt 10 19707.334 ± 973.384 us/op
binarySearch 1000000 5
10 avgt 10 37256.779 ± 2287.203 us/op
binarySearch 1000000 5
100 avgt 10 47391.511 ± 1689.971 us/op
binarySearch 1000000 10
10 avgt 10 76204.615 ± 6515.056 us/op
binarySearch 1000000 10
100 avgt 10 71483.416 ± 7197.376 us/op
binarySearch 1000000 12
10 avgt 10 139481.091 ± 15513.357 us/op
binarySearch 1000000 12
100 avgt 10 142881.846 ± 9511.367 us/op
binarySearch 1000000 15
10 avgt 10 113786.273 ± 4123.158 us/op
binarySearch 1000000 15
100 avgt 10 165300.278 ± 10555.479 us/op
binarySearch 1000000 20
10 avgt 10 138410.942 ± 14330.367 us/op
binarySearch 1000000 20
100 avgt 10 137543.621 ± 9273.845 us/op
binarySearch 1000000 30
10 avgt 10 206512.608 ± 13497.954 us/op
binarySearch 1000000 30
100 avgt 10 305317.908 ± 12452.686 us/op
binarySearch 1000000 50
10 avgt 10 328867.893 ± 14594.249 us/op
binarySearch 1000000 50
100 avgt 10 332823.291 ± 26349.158 us/op
binarySearch 1000000 100
10 avgt 10 668720.906 ± 49193.818 us/op
binarySearch 1000000 100
100 avgt 10 1014546.405 ± 30709.670 us/op
sortedMerge 1000000 1
10 avgt 10 47220.743 ± 2598.755 us/op
sortedMerge 1000000 1
100 avgt 10 53634.485 ± 2886.296 us/op
sortedMerge 1000000 2
10 avgt 10 51201.356 ± 2745.801 us/op
sortedMerge 1000000 2
100 avgt 10 53046.058 ± 4500.987 us/op
sortedMerge 1000000 5
10 avgt 10 58501.742 ± 7320.870 us/op
sortedMerge 1000000 5
100 avgt 10 65597.519 ± 7356.548 us/op
sortedMerge 1000000 10
10 avgt 10 75347.468 ± 9417.556 us/op
sortedMerge 1000000 10
100 avgt 10 74601.584 ± 5251.078 us/op
sortedMerge 1000000 12
10 avgt 10 64838.734 ± 2644.738 us/op
sortedMerge 1000000 12
100 avgt 10 80342.737 ± 5414.306 us/op
sortedMerge 1000000 15
10 avgt 10 83345.836 ± 6034.488 us/op
sortedMerge 1000000 15
100 avgt 10 78405.299 ± 1651.375 us/op
sortedMerge 1000000 20
10 avgt 10 111307.577 ± 10924.456 us/op
sortedMerge 1000000 20
100 avgt 10 89371.173 ± 10347.814 us/op
sortedMerge 1000000 30
10 avgt 10 116740.355 ± 6003.945 us/op
sortedMerge 1000000 30
100 avgt 10 117312.763 ± 6325.076 us/op
sortedMerge 1000000 50
10 avgt 10 132145.411 ± 23121.259 us/op
sortedMerge 1000000 50
100 avgt 10 192802.722 ± 39032.876 us/op
sortedMerge 1000000 100
10 avgt 10 216079.634 ± 28725.333 us/op
sortedMerge 1000000 100
100 avgt 10 236561.476 ± 14369.673 us/op
```
This PR has:
- [x] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]