[GitHub] [druid] clintropolis opened a new pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

GitBox Wed, 23 Feb 2022 03:23:03 -0800


clintropolis opened a new pull request #12277:
URL: https://github.com/apache/druid/pull/12277



   Closes #3922
   
   ### Description
   This PR adds a new way of storing `STRING` typed columns, using an 
incremental encoding strategy called 'front coding'. Essentially, sorted values 
are split into buckets; the first value in the bucket is written completely and 
the remaining values in the bucket are stored as a pair, composed of the length 
of the first string which overlaps this value (the prefix length), and the 
remaining fragment of the value that remains after the prefix. Using the 
quickstart wikipedia example, this results in greatly reduced segment sizes, 
with relatively small performance penalty in most typical Druid queries, that 
grows with the number of columns involved in a query. I suspect in many cases, 
this penalty is probably worth the cost due to being able to fit significantly 
more segments in memory to avoid disk reads, depending on the typical query 
workloads of the cluster.
   
   #### Segment Size and Performance
   Wikipedia "quickstart" segments:
   <img width="968" alt="Screen Shot 2022-02-23 at 3 19 33 AM" 
src="https://user-images.githubusercontent.com/1577461/155309650-ba583aed-417d-4a8c-bcda-0b00aa08b076.png";>
   
   
   `GenericIndexed` vs `FrontCodedIndexed` with bucket size 4 and size 16:
   ```
   Benchmark                                            (indexType)  
(numElements)  (numOperations)  (width)  Mode  Cnt        Score     Error  Units
   FrontCodedIndexedBenchmark.get                           generic          
10000            10000       16  avgt    5       78.819 ±   2.541  ns/op
   FrontCodedIndexedBenchmark.get:encoded size              generic          
10000            10000       16  avgt    5   240010.000            bytes
   FrontCodedIndexedBenchmark.get                     front-coded-4          
10000            10000       16  avgt    5      220.887 ±  19.267  ns/op
   FrontCodedIndexedBenchmark.get:encoded size        front-coded-4          
10000            10000       16  avgt    5   169894.000            bytes
   FrontCodedIndexedBenchmark.get                    front-coded-16          
10000            10000       16  avgt    5      629.621 ±  59.667  ns/op
   FrontCodedIndexedBenchmark.get:encoded size       front-coded-16          
10000            10000       16  avgt    5   164428.000            bytes
   FrontCodedIndexedBenchmark.indexOf                       generic          
10000            10000       16  avgt    5     1228.012 ± 100.691  ns/op
   FrontCodedIndexedBenchmark.indexOf:encoded size          generic          
10000            10000       16  avgt    5   240010.000            bytes
   FrontCodedIndexedBenchmark.indexOf                 front-coded-4          
10000            10000       16  avgt    5     2101.530 ± 209.465  ns/op
   FrontCodedIndexedBenchmark.indexOf:encoded size    front-coded-4          
10000            10000       16  avgt    5   169908.000            bytes
   FrontCodedIndexedBenchmark.indexOf                front-coded-16          
10000            10000       16  avgt    5     2364.248 ± 156.352  ns/op
   FrontCodedIndexedBenchmark.indexOf:encoded size   front-coded-16          
10000            10000       16  avgt    5   164321.000            bytes
   FrontCodedIndexedBenchmark.iterator                      generic          
10000            10000       16  avgt    5       80.247 ±   1.639  ns/op
   FrontCodedIndexedBenchmark.iterator:encoded size         generic          
10000            10000       16  avgt    5   240010.000            bytes
   FrontCodedIndexedBenchmark.iterator                front-coded-4          
10000            10000       16  avgt    5      101.921 ±   6.225  ns/op
   FrontCodedIndexedBenchmark.iterator:encoded size   front-coded-4          
10000            10000       16  avgt    5   169889.000            bytes
   FrontCodedIndexedBenchmark.iterator               front-coded-16          
10000            10000       16  avgt    5      105.235 ±   2.333  ns/op
   FrontCodedIndexedBenchmark.iterator:encoded size  front-coded-16          
10000            10000       16  avgt    5   164221.000            bytes
   
   
   FrontCodedIndexedBenchmark.get                           generic         
100000            10000       16  avgt    5       95.947 ±   3.842  ns/op
   FrontCodedIndexedBenchmark.get:encoded size              generic         
100000            10000       16  avgt    5  2400010.000            bytes
   FrontCodedIndexedBenchmark.get                     front-coded-4         
100000            10000       16  avgt    5      239.630 ±  40.148  ns/op
   FrontCodedIndexedBenchmark.get:encoded size        front-coded-4         
100000            10000       16  avgt    5  1636718.000            bytes
   FrontCodedIndexedBenchmark.get                    front-coded-16         
100000            10000       16  avgt    5      648.001 ±  28.965  ns/op
   FrontCodedIndexedBenchmark.get:encoded size       front-coded-16         
100000            10000       16  avgt    5  1564861.000            bytes
   FrontCodedIndexedBenchmark.indexOf                       generic         
100000            10000       16  avgt    5     1612.575 ± 149.556  ns/op
   FrontCodedIndexedBenchmark.indexOf:encoded size          generic         
100000            10000       16  avgt    5  2400010.000            bytes
   FrontCodedIndexedBenchmark.indexOf                 front-coded-4         
100000            10000       16  avgt    5     2934.297 ± 176.182  ns/op
   FrontCodedIndexedBenchmark.indexOf:encoded size    front-coded-4         
100000            10000       16  avgt    5  1636652.000            bytes
   FrontCodedIndexedBenchmark.indexOf                front-coded-16         
100000            10000       16  avgt    5     3235.271 ± 438.586  ns/op
   FrontCodedIndexedBenchmark.indexOf:encoded size   front-coded-16         
100000            10000       16  avgt    5  1563940.000            bytes
   FrontCodedIndexedBenchmark.iterator                      generic         
100000            10000       16  avgt    5      787.253 ±  83.715  ns/op
   FrontCodedIndexedBenchmark.iterator:encoded size         generic         
100000            10000       16  avgt    5  2400010.000            bytes
   FrontCodedIndexedBenchmark.iterator                front-coded-4         
100000            10000       16  avgt    5     1048.532 ±  29.200  ns/op
   FrontCodedIndexedBenchmark.iterator:encoded size   front-coded-4         
100000            10000       16  avgt    5  1636546.000            bytes
   FrontCodedIndexedBenchmark.iterator               front-coded-16         
100000            10000       16  avgt    5     1054.068 ±  66.667  ns/op
   FrontCodedIndexedBenchmark.iterator:encoded size  front-coded-16         
100000            10000       16  avgt    5  1565019.000            bytes
   ```
   
   How this translates into queries has so far been measured with 
`SqlBenchmark`, though it's generated data-set does pretty poorly with this 
encoding due to being composed of numbers which have been translated directly 
into strings, leaving few prefixes to take advantage of (383MB with generic 
indexed compared to 381MB with front-coded indexed). This means that any size 
advantage is not present here and so this is likely near worst case for this 
encoding strategy.
   
   
   ```
   Benchmark              (query)  (rowsPerSegment)  (stringEncoding)  
(vectorize)  Mode  Cnt      Score      Error  Units
   SqlBenchmark.querySql        4           5000000              none        
false  avgt    5    152.079 ±    7.234  ms/op
   SqlBenchmark.querySql        4           5000000              none        
force  avgt    5     76.931 ±    3.635  ms/op
   SqlBenchmark.querySql        4           5000000               fc4        
false  avgt    5    151.430 ±   11.542  ms/op
   SqlBenchmark.querySql        4           5000000               fc4        
force  avgt    5     77.045 ±    3.972  ms/op
   SqlBenchmark.querySql        4           5000000              fc16        
false  avgt    5    160.475 ±   16.692  ms/op
   SqlBenchmark.querySql        4           5000000              fc16        
force  avgt    5     75.423 ±    4.572  ms/op
   SqlBenchmark.querySql        5           5000000              none        
false  avgt    5     49.356 ±    2.468  ms/op
   SqlBenchmark.querySql        5           5000000              none        
force  avgt    5     49.141 ±    2.853  ms/op
   SqlBenchmark.querySql        5           5000000               fc4        
false  avgt    5     49.169 ±    2.025  ms/op
   SqlBenchmark.querySql        5           5000000               fc4        
force  avgt    5     49.232 ±    2.801  ms/op
   SqlBenchmark.querySql        5           5000000              fc16        
false  avgt    5     49.438 ±    2.585  ms/op
   SqlBenchmark.querySql        5           5000000              fc16        
force  avgt    5     49.451 ±    3.924  ms/op
   SqlBenchmark.querySql        6           5000000              none        
false  avgt    5    199.922 ±    7.085  ms/op
   SqlBenchmark.querySql        6           5000000              none        
force  avgt    5    105.862 ±    7.491  ms/op
   SqlBenchmark.querySql        6           5000000               fc4        
false  avgt    5    209.565 ±   21.137  ms/op
   SqlBenchmark.querySql        6           5000000               fc4        
force  avgt    5    106.155 ±    7.080  ms/op
   SqlBenchmark.querySql        6           5000000              fc16        
false  avgt    5    199.692 ±   11.608  ms/op
   SqlBenchmark.querySql        6           5000000              fc16        
force  avgt    5    107.823 ±    9.708  ms/op
   SqlBenchmark.querySql        7           5000000              none        
false  avgt    5    154.474 ±    7.260  ms/op
   SqlBenchmark.querySql        7           5000000              none        
force  avgt    5     84.477 ±    4.757  ms/op
   SqlBenchmark.querySql        7           5000000               fc4        
false  avgt    5    150.924 ±    8.694  ms/op
   SqlBenchmark.querySql        7           5000000               fc4        
force  avgt    5     82.500 ±    4.569  ms/op
   SqlBenchmark.querySql        7           5000000              fc16        
false  avgt    5    151.421 ±    7.497  ms/op
   SqlBenchmark.querySql        7           5000000              fc16        
force  avgt    5     86.868 ±    3.030  ms/op
   SqlBenchmark.querySql       10           5000000              none        
false  avgt    5    780.820 ±   68.020  ms/op
   SqlBenchmark.querySql       10           5000000              none        
force  avgt    5    432.962 ±   62.454  ms/op
   SqlBenchmark.querySql       10           5000000               fc4        
false  avgt    5    798.933 ±   71.454  ms/op
   SqlBenchmark.querySql       10           5000000               fc4        
force  avgt    5    458.484 ±   37.257  ms/op
   SqlBenchmark.querySql       10           5000000              fc16        
false  avgt    5    872.225 ±   49.937  ms/op
   SqlBenchmark.querySql       10           5000000              fc16        
force  avgt    5    522.340 ±   35.679  ms/op
   SqlBenchmark.querySql       12           5000000              none        
false  avgt    5     69.159 ±    4.653  ms/op
   SqlBenchmark.querySql       12           5000000              none        
force  avgt    5     33.735 ±    2.004  ms/op
   SqlBenchmark.querySql       12           5000000               fc4        
false  avgt    5     69.006 ±    4.162  ms/op
   SqlBenchmark.querySql       12           5000000               fc4        
force  avgt    5     33.413 ±    1.792  ms/op
   SqlBenchmark.querySql       12           5000000              fc16        
false  avgt    5     68.738 ±    3.524  ms/op
   SqlBenchmark.querySql       12           5000000              fc16        
force  avgt    5     34.627 ±    2.654  ms/op
   SqlBenchmark.querySql       19           5000000              none        
false  avgt    5    700.311 ±  100.489  ms/op
   SqlBenchmark.querySql       19           5000000              none        
force  avgt    5    601.641 ±   39.403  ms/op
   SqlBenchmark.querySql       19           5000000               fc4        
false  avgt    5   1056.911 ±   91.692  ms/op
   SqlBenchmark.querySql       19           5000000               fc4        
force  avgt    5    963.832 ±   55.223  ms/op
   SqlBenchmark.querySql       19           5000000              fc16        
false  avgt    5   2159.028 ±  145.168  ms/op
   SqlBenchmark.querySql       19           5000000              fc16        
force  avgt    5   2107.042 ±  123.906  ms/op
   SqlBenchmark.querySql       21           5000000              none        
false  avgt    5    773.091 ±   46.505  ms/op
   SqlBenchmark.querySql       21           5000000              none        
force  avgt    5    459.661 ±   36.800  ms/op
   SqlBenchmark.querySql       21           5000000               fc4        
false  avgt    5    837.426 ±   41.612  ms/op
   SqlBenchmark.querySql       21           5000000               fc4        
force  avgt    5    504.546 ±   18.115  ms/op
   SqlBenchmark.querySql       21           5000000              fc16        
false  avgt    5    919.155 ±  106.868  ms/op
   SqlBenchmark.querySql       21           5000000              fc16        
force  avgt    5    602.116 ±    7.865  ms/op
   SqlBenchmark.querySql       22           5000000              none        
false  avgt    5     44.268 ±    2.707  ms/op
   SqlBenchmark.querySql       22           5000000              none        
force  avgt    5     43.919 ±    2.962  ms/op
   SqlBenchmark.querySql       22           5000000               fc4        
false  avgt    5     44.458 ±    3.590  ms/op
   SqlBenchmark.querySql       22           5000000               fc4        
force  avgt    5     43.738 ±    2.317  ms/op
   SqlBenchmark.querySql       22           5000000              fc16        
false  avgt    5     44.304 ±    2.876  ms/op
   SqlBenchmark.querySql       22           5000000              fc16        
force  avgt    5     43.906 ±    2.283  ms/op
   SqlBenchmark.querySql       23           5000000              none        
false  avgt    5  18260.991 ± 1041.530  ms/op
   SqlBenchmark.querySql       23           5000000              none        
force  avgt    5  18400.261 ± 1689.495  ms/op
   SqlBenchmark.querySql       23           5000000               fc4        
false  avgt    5  30704.667 ± 1624.680  ms/op
   SqlBenchmark.querySql       23           5000000               fc4        
force  avgt    5  30603.476 ±  621.976  ms/op
   SqlBenchmark.querySql       23           5000000              fc16        
false  avgt    5  48777.535 ± 3223.712  ms/op
   SqlBenchmark.querySql       23           5000000              fc16        
force  avgt    5  49966.818 ± 3739.684  ms/op
   ```
   
   much of the performance is reasonably close, with the exception of query 23, 
`SELECT * from foo` which is somewhat expected to be a fairly dramatic, though 
certainly not great. It might be harder to consider turning on this encoding if 
typical query patterns include a lot of `SELECT *` or groupings on a large 
number of columns, as the extra cost to decode dictionaries for value lookup 
seems to have the potential to begin dominating query time, at least compared 
to the current strategy.
   
   I need to try to wire up some benchmarks similar to `SqlBenchmark` but 
backed by some actual segments so I can see how this performs with more 
realistic data at the segment query level.
   
   I've spent very little time attempting to optimize `FrontCodedIndexed` as 
well, so I'm sure there is room for improvement.
   
   #### Design
   The encoding itself is done within a new `Indexed<String>` implementation, 
`FrontCodedIndexed`, which contains all of the methods for reading and writing 
the buckets of values. I adapted 'variable byte' encoding for integer values 
from 
[JavaFastPFOR](https://github.com/lemire/JavaFastPFOR/blob/master/src/main/java/me/lemire/integercompression/VariableByte.java)
 to write both string value lengths as well as prefix lengths. 
   
   string layout:
   | length | value |
   | --- | --- |
   | vbyte int | byte[] |
   
   bucket layout:
   | first string | prefix length | string fragment | ... | prefix length | 
string fragment |
   | --- | --- | --- | --- | --- | --- |
   | string | vbyte int | string | ... | vbyte int | string |
   
   front coded indexed layout:
   | version | bucket size | has null? | number of values | size of "offsets" + 
"buckets" | "offsets" | "buckets" |
   | --- | --- | --- | --- | --- | --- | --- | 
   | byte | byte | byte | vbyte int | vbyte int | int[] | bucket[] |
   
   Note that the "offsets" store the starting location of all buckets beyond 
the first bucket (whose offset is known to be the end of the "offsets" 
position). The "offsets" are stored as plain integer values instead of vbyte 
encoded to allow for fast access of the bucket positions, but are probably a 
good candidate for delta encoded byte packing to further decrease their size.
   
   #### Using it
   This functionality can be utilized by a new property to `IndexSpec`, 
`stringDictionaryEncoding`, which can be set to `"fc4"` or `"fc16"` to instruct 
indexing tasks to write segments with the compressed dictionaries with bucket 
size 4 or 16 respectively. (`"none"` is the default).
   
   This mode is not set it as the default yet because any segments written like 
this will be unreadable by older versions of Druid, so care must be taken 
before migrating to this encoding. Additionally, this needs a lot more testing 
and measurement to ensure that it is genuinely better in most cases before 
making it the default.
   
   #### Testing
   Besides the direct tests on `FrontCodedIndexed` and `VByte`, I also wired a 
front-coded segment into both `BaseFilterTest` and `QueryTestHelper`, which 
provides a rather wide set of test coverage for a variety of scenarios. This 
process found a number of bugs in my initial commits, so I feel reasonably 
confident that things are correct at this point.
   
   #### Future work
   
   Before I started coding on this, in addition to the paper linked in #3922, 
https://arxiv.org/pdf/1101.5506.pdf, I also read through 
https://arxiv.org/pdf/1911.08372.pdf which was a newer paper by one of the 
authors on the first link, and also stumbled upon 
https://link.springer.com/content/pdf/10.1007/s00778-020-00620-x.pdf, all of 
which detail additional (much fancier) improvements which can be made to this 
strategy by further coding the string values 
(https://en.wikipedia.org/wiki/Re-Pair seems the primary focus in these 
papers). It would probably be worth investigating to determine at what cost 
additional size improvements can be gained.
   
   Additionally, it seems to be ideal to be able to vary which encoding is used 
per column instead of setting it at the segment level (this seems true of other 
types of compression as well...). This could allow collection of statistic at 
indexing time to determine how likely this encoding is to be useful, such as 
minimum value cardinality thresholds or similar (akin to the 'auto' encoding 
available for long columns).
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `VByte`
    * `FrontCodedIndexed`
    * `FrontCodedIndexedWriter`
    * `StringDimensionMergerV9`
    * `DictionaryEncodedColumnPartSerde`
    * `StringFrontCodedDictionaryEncodedColumn`
    * `StringFrontCodedDictionaryEncodedColumnSupplier`
    * `StringFrontCodedBitmapIndexColumnPartSupplier`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

Reply via email to