clintropolis opened a new pull request #12277: URL: https://github.com/apache/druid/pull/12277
Closes #3922 ### Description This PR adds a new way of storing `STRING` typed columns, using an incremental encoding strategy called 'front coding'. Essentially, sorted values are split into buckets; the first value in the bucket is written completely and the remaining values in the bucket are stored as a pair, composed of the length of the first string which overlaps this value (the prefix length), and the remaining fragment of the value that remains after the prefix. Using the quickstart wikipedia example, this results in greatly reduced segment sizes, with relatively small performance penalty in most typical Druid queries, that grows with the number of columns involved in a query. I suspect in many cases, this penalty is probably worth the cost due to being able to fit significantly more segments in memory to avoid disk reads, depending on the typical query workloads of the cluster. #### Segment Size and Performance Wikipedia "quickstart" segments: <img width="968" alt="Screen Shot 2022-02-23 at 3 19 33 AM" src="https://user-images.githubusercontent.com/1577461/155309650-ba583aed-417d-4a8c-bcda-0b00aa08b076.png"> `GenericIndexed` vs `FrontCodedIndexed` with bucket size 4 and size 16: ``` Benchmark (indexType) (numElements) (numOperations) (width) Mode Cnt Score Error Units FrontCodedIndexedBenchmark.get generic 10000 10000 16 avgt 5 78.819 ± 2.541 ns/op FrontCodedIndexedBenchmark.get:encoded size generic 10000 10000 16 avgt 5 240010.000 bytes FrontCodedIndexedBenchmark.get front-coded-4 10000 10000 16 avgt 5 220.887 ± 19.267 ns/op FrontCodedIndexedBenchmark.get:encoded size front-coded-4 10000 10000 16 avgt 5 169894.000 bytes FrontCodedIndexedBenchmark.get front-coded-16 10000 10000 16 avgt 5 629.621 ± 59.667 ns/op FrontCodedIndexedBenchmark.get:encoded size front-coded-16 10000 10000 16 avgt 5 164428.000 bytes FrontCodedIndexedBenchmark.indexOf generic 10000 10000 16 avgt 5 1228.012 ± 100.691 ns/op FrontCodedIndexedBenchmark.indexOf:encoded size generic 10000 10000 16 avgt 5 240010.000 bytes FrontCodedIndexedBenchmark.indexOf front-coded-4 10000 10000 16 avgt 5 2101.530 ± 209.465 ns/op FrontCodedIndexedBenchmark.indexOf:encoded size front-coded-4 10000 10000 16 avgt 5 169908.000 bytes FrontCodedIndexedBenchmark.indexOf front-coded-16 10000 10000 16 avgt 5 2364.248 ± 156.352 ns/op FrontCodedIndexedBenchmark.indexOf:encoded size front-coded-16 10000 10000 16 avgt 5 164321.000 bytes FrontCodedIndexedBenchmark.iterator generic 10000 10000 16 avgt 5 80.247 ± 1.639 ns/op FrontCodedIndexedBenchmark.iterator:encoded size generic 10000 10000 16 avgt 5 240010.000 bytes FrontCodedIndexedBenchmark.iterator front-coded-4 10000 10000 16 avgt 5 101.921 ± 6.225 ns/op FrontCodedIndexedBenchmark.iterator:encoded size front-coded-4 10000 10000 16 avgt 5 169889.000 bytes FrontCodedIndexedBenchmark.iterator front-coded-16 10000 10000 16 avgt 5 105.235 ± 2.333 ns/op FrontCodedIndexedBenchmark.iterator:encoded size front-coded-16 10000 10000 16 avgt 5 164221.000 bytes FrontCodedIndexedBenchmark.get generic 100000 10000 16 avgt 5 95.947 ± 3.842 ns/op FrontCodedIndexedBenchmark.get:encoded size generic 100000 10000 16 avgt 5 2400010.000 bytes FrontCodedIndexedBenchmark.get front-coded-4 100000 10000 16 avgt 5 239.630 ± 40.148 ns/op FrontCodedIndexedBenchmark.get:encoded size front-coded-4 100000 10000 16 avgt 5 1636718.000 bytes FrontCodedIndexedBenchmark.get front-coded-16 100000 10000 16 avgt 5 648.001 ± 28.965 ns/op FrontCodedIndexedBenchmark.get:encoded size front-coded-16 100000 10000 16 avgt 5 1564861.000 bytes FrontCodedIndexedBenchmark.indexOf generic 100000 10000 16 avgt 5 1612.575 ± 149.556 ns/op FrontCodedIndexedBenchmark.indexOf:encoded size generic 100000 10000 16 avgt 5 2400010.000 bytes FrontCodedIndexedBenchmark.indexOf front-coded-4 100000 10000 16 avgt 5 2934.297 ± 176.182 ns/op FrontCodedIndexedBenchmark.indexOf:encoded size front-coded-4 100000 10000 16 avgt 5 1636652.000 bytes FrontCodedIndexedBenchmark.indexOf front-coded-16 100000 10000 16 avgt 5 3235.271 ± 438.586 ns/op FrontCodedIndexedBenchmark.indexOf:encoded size front-coded-16 100000 10000 16 avgt 5 1563940.000 bytes FrontCodedIndexedBenchmark.iterator generic 100000 10000 16 avgt 5 787.253 ± 83.715 ns/op FrontCodedIndexedBenchmark.iterator:encoded size generic 100000 10000 16 avgt 5 2400010.000 bytes FrontCodedIndexedBenchmark.iterator front-coded-4 100000 10000 16 avgt 5 1048.532 ± 29.200 ns/op FrontCodedIndexedBenchmark.iterator:encoded size front-coded-4 100000 10000 16 avgt 5 1636546.000 bytes FrontCodedIndexedBenchmark.iterator front-coded-16 100000 10000 16 avgt 5 1054.068 ± 66.667 ns/op FrontCodedIndexedBenchmark.iterator:encoded size front-coded-16 100000 10000 16 avgt 5 1565019.000 bytes ``` How this translates into queries has so far been measured with `SqlBenchmark`, though it's generated data-set does pretty poorly with this encoding due to being composed of numbers which have been translated directly into strings, leaving few prefixes to take advantage of (383MB with generic indexed compared to 381MB with front-coded indexed). This means that any size advantage is not present here and so this is likely near worst case for this encoding strategy. ``` Benchmark (query) (rowsPerSegment) (stringEncoding) (vectorize) Mode Cnt Score Error Units SqlBenchmark.querySql 4 5000000 none false avgt 5 152.079 ± 7.234 ms/op SqlBenchmark.querySql 4 5000000 none force avgt 5 76.931 ± 3.635 ms/op SqlBenchmark.querySql 4 5000000 fc4 false avgt 5 151.430 ± 11.542 ms/op SqlBenchmark.querySql 4 5000000 fc4 force avgt 5 77.045 ± 3.972 ms/op SqlBenchmark.querySql 4 5000000 fc16 false avgt 5 160.475 ± 16.692 ms/op SqlBenchmark.querySql 4 5000000 fc16 force avgt 5 75.423 ± 4.572 ms/op SqlBenchmark.querySql 5 5000000 none false avgt 5 49.356 ± 2.468 ms/op SqlBenchmark.querySql 5 5000000 none force avgt 5 49.141 ± 2.853 ms/op SqlBenchmark.querySql 5 5000000 fc4 false avgt 5 49.169 ± 2.025 ms/op SqlBenchmark.querySql 5 5000000 fc4 force avgt 5 49.232 ± 2.801 ms/op SqlBenchmark.querySql 5 5000000 fc16 false avgt 5 49.438 ± 2.585 ms/op SqlBenchmark.querySql 5 5000000 fc16 force avgt 5 49.451 ± 3.924 ms/op SqlBenchmark.querySql 6 5000000 none false avgt 5 199.922 ± 7.085 ms/op SqlBenchmark.querySql 6 5000000 none force avgt 5 105.862 ± 7.491 ms/op SqlBenchmark.querySql 6 5000000 fc4 false avgt 5 209.565 ± 21.137 ms/op SqlBenchmark.querySql 6 5000000 fc4 force avgt 5 106.155 ± 7.080 ms/op SqlBenchmark.querySql 6 5000000 fc16 false avgt 5 199.692 ± 11.608 ms/op SqlBenchmark.querySql 6 5000000 fc16 force avgt 5 107.823 ± 9.708 ms/op SqlBenchmark.querySql 7 5000000 none false avgt 5 154.474 ± 7.260 ms/op SqlBenchmark.querySql 7 5000000 none force avgt 5 84.477 ± 4.757 ms/op SqlBenchmark.querySql 7 5000000 fc4 false avgt 5 150.924 ± 8.694 ms/op SqlBenchmark.querySql 7 5000000 fc4 force avgt 5 82.500 ± 4.569 ms/op SqlBenchmark.querySql 7 5000000 fc16 false avgt 5 151.421 ± 7.497 ms/op SqlBenchmark.querySql 7 5000000 fc16 force avgt 5 86.868 ± 3.030 ms/op SqlBenchmark.querySql 10 5000000 none false avgt 5 780.820 ± 68.020 ms/op SqlBenchmark.querySql 10 5000000 none force avgt 5 432.962 ± 62.454 ms/op SqlBenchmark.querySql 10 5000000 fc4 false avgt 5 798.933 ± 71.454 ms/op SqlBenchmark.querySql 10 5000000 fc4 force avgt 5 458.484 ± 37.257 ms/op SqlBenchmark.querySql 10 5000000 fc16 false avgt 5 872.225 ± 49.937 ms/op SqlBenchmark.querySql 10 5000000 fc16 force avgt 5 522.340 ± 35.679 ms/op SqlBenchmark.querySql 12 5000000 none false avgt 5 69.159 ± 4.653 ms/op SqlBenchmark.querySql 12 5000000 none force avgt 5 33.735 ± 2.004 ms/op SqlBenchmark.querySql 12 5000000 fc4 false avgt 5 69.006 ± 4.162 ms/op SqlBenchmark.querySql 12 5000000 fc4 force avgt 5 33.413 ± 1.792 ms/op SqlBenchmark.querySql 12 5000000 fc16 false avgt 5 68.738 ± 3.524 ms/op SqlBenchmark.querySql 12 5000000 fc16 force avgt 5 34.627 ± 2.654 ms/op SqlBenchmark.querySql 19 5000000 none false avgt 5 700.311 ± 100.489 ms/op SqlBenchmark.querySql 19 5000000 none force avgt 5 601.641 ± 39.403 ms/op SqlBenchmark.querySql 19 5000000 fc4 false avgt 5 1056.911 ± 91.692 ms/op SqlBenchmark.querySql 19 5000000 fc4 force avgt 5 963.832 ± 55.223 ms/op SqlBenchmark.querySql 19 5000000 fc16 false avgt 5 2159.028 ± 145.168 ms/op SqlBenchmark.querySql 19 5000000 fc16 force avgt 5 2107.042 ± 123.906 ms/op SqlBenchmark.querySql 21 5000000 none false avgt 5 773.091 ± 46.505 ms/op SqlBenchmark.querySql 21 5000000 none force avgt 5 459.661 ± 36.800 ms/op SqlBenchmark.querySql 21 5000000 fc4 false avgt 5 837.426 ± 41.612 ms/op SqlBenchmark.querySql 21 5000000 fc4 force avgt 5 504.546 ± 18.115 ms/op SqlBenchmark.querySql 21 5000000 fc16 false avgt 5 919.155 ± 106.868 ms/op SqlBenchmark.querySql 21 5000000 fc16 force avgt 5 602.116 ± 7.865 ms/op SqlBenchmark.querySql 22 5000000 none false avgt 5 44.268 ± 2.707 ms/op SqlBenchmark.querySql 22 5000000 none force avgt 5 43.919 ± 2.962 ms/op SqlBenchmark.querySql 22 5000000 fc4 false avgt 5 44.458 ± 3.590 ms/op SqlBenchmark.querySql 22 5000000 fc4 force avgt 5 43.738 ± 2.317 ms/op SqlBenchmark.querySql 22 5000000 fc16 false avgt 5 44.304 ± 2.876 ms/op SqlBenchmark.querySql 22 5000000 fc16 force avgt 5 43.906 ± 2.283 ms/op SqlBenchmark.querySql 23 5000000 none false avgt 5 18260.991 ± 1041.530 ms/op SqlBenchmark.querySql 23 5000000 none force avgt 5 18400.261 ± 1689.495 ms/op SqlBenchmark.querySql 23 5000000 fc4 false avgt 5 30704.667 ± 1624.680 ms/op SqlBenchmark.querySql 23 5000000 fc4 force avgt 5 30603.476 ± 621.976 ms/op SqlBenchmark.querySql 23 5000000 fc16 false avgt 5 48777.535 ± 3223.712 ms/op SqlBenchmark.querySql 23 5000000 fc16 force avgt 5 49966.818 ± 3739.684 ms/op ``` much of the performance is reasonably close, with the exception of query 23, `SELECT * from foo` which is somewhat expected to be a fairly dramatic, though certainly not great. It might be harder to consider turning on this encoding if typical query patterns include a lot of `SELECT *` or groupings on a large number of columns, as the extra cost to decode dictionaries for value lookup seems to have the potential to begin dominating query time, at least compared to the current strategy. I need to try to wire up some benchmarks similar to `SqlBenchmark` but backed by some actual segments so I can see how this performs with more realistic data at the segment query level. I've spent very little time attempting to optimize `FrontCodedIndexed` as well, so I'm sure there is room for improvement. #### Design The encoding itself is done within a new `Indexed<String>` implementation, `FrontCodedIndexed`, which contains all of the methods for reading and writing the buckets of values. I adapted 'variable byte' encoding for integer values from [JavaFastPFOR](https://github.com/lemire/JavaFastPFOR/blob/master/src/main/java/me/lemire/integercompression/VariableByte.java) to write both string value lengths as well as prefix lengths. string layout: | length | value | | --- | --- | | vbyte int | byte[] | bucket layout: | first string | prefix length | string fragment | ... | prefix length | string fragment | | --- | --- | --- | --- | --- | --- | | string | vbyte int | string | ... | vbyte int | string | front coded indexed layout: | version | bucket size | has null? | number of values | size of "offsets" + "buckets" | "offsets" | "buckets" | | --- | --- | --- | --- | --- | --- | --- | | byte | byte | byte | vbyte int | vbyte int | int[] | bucket[] | Note that the "offsets" store the starting location of all buckets beyond the first bucket (whose offset is known to be the end of the "offsets" position). The "offsets" are stored as plain integer values instead of vbyte encoded to allow for fast access of the bucket positions, but are probably a good candidate for delta encoded byte packing to further decrease their size. #### Using it This functionality can be utilized by a new property to `IndexSpec`, `stringDictionaryEncoding`, which can be set to `"fc4"` or `"fc16"` to instruct indexing tasks to write segments with the compressed dictionaries with bucket size 4 or 16 respectively. (`"none"` is the default). This mode is not set it as the default yet because any segments written like this will be unreadable by older versions of Druid, so care must be taken before migrating to this encoding. Additionally, this needs a lot more testing and measurement to ensure that it is genuinely better in most cases before making it the default. #### Testing Besides the direct tests on `FrontCodedIndexed` and `VByte`, I also wired a front-coded segment into both `BaseFilterTest` and `QueryTestHelper`, which provides a rather wide set of test coverage for a variety of scenarios. This process found a number of bugs in my initial commits, so I feel reasonably confident that things are correct at this point. #### Future work Before I started coding on this, in addition to the paper linked in #3922, https://arxiv.org/pdf/1101.5506.pdf, I also read through https://arxiv.org/pdf/1911.08372.pdf which was a newer paper by one of the authors on the first link, and also stumbled upon https://link.springer.com/content/pdf/10.1007/s00778-020-00620-x.pdf, all of which detail additional (much fancier) improvements which can be made to this strategy by further coding the string values (https://en.wikipedia.org/wiki/Re-Pair seems the primary focus in these papers). It would probably be worth investigating to determine at what cost additional size improvements can be gained. Additionally, it seems to be ideal to be able to vary which encoding is used per column instead of setting it at the segment level (this seems true of other types of compression as well...). This could allow collection of statistic at indexing time to determine how likely this encoding is to be useful, such as minimum value cardinality thresholds or similar (akin to the 'auto' encoding available for long columns). <hr> ##### Key changed/added classes in this PR * `VByte` * `FrontCodedIndexed` * `FrontCodedIndexedWriter` * `StringDimensionMergerV9` * `DictionaryEncodedColumnPartSerde` * `StringFrontCodedDictionaryEncodedColumn` * `StringFrontCodedDictionaryEncodedColumnSupplier` * `StringFrontCodedBitmapIndexColumnPartSupplier` <hr> <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. --> This PR has: - [ ] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [ ] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
