Re: [PR] Add inverted-index-based distinct operator with runtime cost heuristic [pinot]

via GitHub Tue, 24 Mar 2026 00:27:31 -0700


xiangfu0 commented on PR #17872:
URL: https://github.com/apache/pinot/pull/17872#issuecomment-4115993260


   ## Updated JMH Benchmark Results
   
   **Setup:** 1M docs, 4 cardinalities × 5 selectivities × 3 paths, 
`@Fork(value=2, warmups=0)`, `@Warmup(iterations=2)`, 
`@Measurement(iterations=3)`, JDK 17, ~34 min total
   
   ### Sorted index path (μs/op, lower is better)
   
   | dictCard | 0.1% (1K docs) | 1% (10K) | 10% (100K) | 50% (500K) | 100% (1M) 
|
   
|----------|----------------|----------|------------|------------|-----------|
   | 100 | 1.3 | 1.6 | 1.1 | 1.3 | 1.2 |
   | 1K | 7.7 | 13.8 | 10.5 | 9.6 | 9.1 |
   | 10K | 45.7 | 81.3 | 101 | 86.0 | 95.7 |
   | 100K | 243 | 301 | 888 | 822 | 646 |
   
   ### Bitmap inverted index path (μs/op)
   
   | dictCard | 0.1% (1K docs) | 1% (10K) | 10% (100K) | 50% (500K) | 100% (1M) 
|
   
|----------|----------------|----------|------------|------------|-----------|
   | 100 | 38 | 19 | 1.6 | 1.0 | 1.0 |
   | 1K | 4,256 | 767 | 27 | 16 | 13 |
   | 10K | 12,979 | 14,053 | 369 | 179 | 137 |
   | 100K | 35,459 | 61,037 | 10,635 | 5,780 | 3,522 |
   
   ### Scan path (μs/op)
   
   | dictCard | 0.1% (1K docs) | 1% (10K) | 10% (100K) | 50% (500K) | 100% (1M) 
|
   
|----------|----------------|----------|------------|------------|-----------|
   | 100 | 9 | 120 | 2,357 | 11,064 | 20,114 |
   | 1K | 17 | 246 | 3,157 | 15,996 | 30,353 |
   | 10K | 19 | 342 | 697 | 2,388 | 4,323 |
   | 100K | 19 | 609 | 1,518 | 4,220 | 7,680 |
   
   ### Cost heuristic crossover validation
   
   | dictCard | configured ratio | heuristic threshold | actual crossover |
   |----------|-----------------|--------------------|-----------------| 
   | 100 | 30x | 3,000 filtered docs | inverted wins at 1%+ (10K docs) ✓ |
   | 1K | 30x | 30,000 filtered docs | inverted wins at 10%+ (100K docs) ✓ |
   | 10K | 10x | 100,000 filtered docs | inverted wins at 10%+ (100K docs) ✓ |
   | 100K | 6x | 600,000 filtered docs | inverted wins at 100% (1M docs) ✓ |
   
   ### Key takeaways
   
   - **Sorted index** is the fastest path across all configurations (1–888 μs), 
always chosen when a sorted column is available
   - **Bitmap inverted index** dominates scan when `filteredDocs >> dictCard` 
(e.g., dictCard=100 at 10%: **1.6 μs vs 2,357 μs = 1,473x faster**)
   - **Scan** dominates inverted index when dictCard is high relative to 
filtered docs (e.g., dictCard=1K at 0.1%: **17 μs vs 4,256 μs**)
   - Per-cardinality cost ratios (30x / 10x / 6x) correctly steer the heuristic 
at all crossover points
   
   <details>
   <summary>Raw JMH output</summary>
   
   ```
   Benchmark                                         (_dictionaryCardinality)  
(_filterSelectivity)  (_numDocs)  Mode  Cnt      Score       Error  Units
   BenchmarkInvertedIndexDistinct.invertedIndexPath                       100   
              0.001     1000000  avgt    6     38.231 ±     1.357  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                       100   
               0.01     1000000  avgt    6     18.694 ±     4.612  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                       100   
                0.1     1000000  avgt    6      1.584 ±     0.168  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                       100   
                0.5     1000000  avgt    6      0.983 ±     0.104  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                       100   
                1.0     1000000  avgt    6      0.972 ±     0.017  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                      1000   
              0.001     1000000  avgt    6   4256.134 ±   157.568  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                      1000   
               0.01     1000000  avgt    6    766.973 ±    25.905  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                      1000   
                0.1     1000000  avgt    6     26.504 ±     1.263  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                      1000   
                0.5     1000000  avgt    6     15.891 ±     0.426  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                      1000   
                1.0     1000000  avgt    6     12.598 ±     0.295  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                     10000   
              0.001     1000000  avgt    6  12979.261 ±  6223.626  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                     10000   
               0.01     1000000  avgt    6  14053.085 ±   417.580  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                     10000   
                0.1     1000000  avgt    6    369.112 ±    22.802  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                     10000   
                0.5     1000000  avgt    6    179.168 ±    14.620  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                     10000   
                1.0     1000000  avgt    6    137.372 ±     6.651  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                    100000   
              0.001     1000000  avgt    6  35458.867 ± 25631.306  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                    100000   
               0.01     1000000  avgt    6  61036.517 ±  4329.001  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                    100000   
                0.1     1000000  avgt    6  10635.409 ±   501.969  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                    100000   
                0.5     1000000  avgt    6   5779.762 ±   394.506  us/op
   BenchmarkInvertedIndexDistinct.invertedIndexPath                    100000   
                1.0     1000000  avgt    6   3521.908 ±   160.910  us/op
   BenchmarkInvertedIndexDistinct.scanPath                                100   
              0.001     1000000  avgt    6      9.273 ±     0.510  us/op
   BenchmarkInvertedIndexDistinct.scanPath                                100   
               0.01     1000000  avgt    6    119.694 ±    19.105  us/op
   BenchmarkInvertedIndexDistinct.scanPath                                100   
                0.1     1000000  avgt    6   2356.700 ±   129.508  us/op
   BenchmarkInvertedIndexDistinct.scanPath                                100   
                0.5     1000000  avgt    6  11064.090 ±   314.943  us/op
   BenchmarkInvertedIndexDistinct.scanPath                                100   
                1.0     1000000  avgt    6  20114.177 ±   520.504  us/op
   BenchmarkInvertedIndexDistinct.scanPath                               1000   
              0.001     1000000  avgt    6     17.063 ±     0.130  us/op
   BenchmarkInvertedIndexDistinct.scanPath                               1000   
               0.01     1000000  avgt    6    245.697 ±    21.773  us/op
   BenchmarkInvertedIndexDistinct.scanPath                               1000   
                0.1     1000000  avgt    6   3156.635 ±    65.038  us/op
   BenchmarkInvertedIndexDistinct.scanPath                               1000   
                0.5     1000000  avgt    6  15996.288 ±  1633.042  us/op
   BenchmarkInvertedIndexDistinct.scanPath                               1000   
                1.0     1000000  avgt    6  30352.554 ±   535.360  us/op
   BenchmarkInvertedIndexDistinct.scanPath                              10000   
              0.001     1000000  avgt    6     19.081 ±     0.530  us/op
   BenchmarkInvertedIndexDistinct.scanPath                              10000   
               0.01     1000000  avgt    6    342.450 ±    20.136  us/op
   BenchmarkInvertedIndexDistinct.scanPath                              10000   
                0.1     1000000  avgt    6    697.304 ±    12.680  us/op
   BenchmarkInvertedIndexDistinct.scanPath                              10000   
                0.5     1000000  avgt    6   2388.036 ±    35.300  us/op
   BenchmarkInvertedIndexDistinct.scanPath                              10000   
                1.0     1000000  avgt    6   4323.113 ±   342.050  us/op
   BenchmarkInvertedIndexDistinct.scanPath                             100000   
              0.001     1000000  avgt    6     19.445 ±     2.081  us/op
   BenchmarkInvertedIndexDistinct.scanPath                             100000   
               0.01     1000000  avgt    6    608.963 ±    51.412  us/op
   BenchmarkInvertedIndexDistinct.scanPath                             100000   
                0.1     1000000  avgt    6   1517.888 ±   137.011  us/op
   BenchmarkInvertedIndexDistinct.scanPath                             100000   
                0.5     1000000  avgt    6   4219.561 ±   248.307  us/op
   BenchmarkInvertedIndexDistinct.scanPath                             100000   
                1.0     1000000  avgt    6   7680.396 ±   136.475  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                         100   
              0.001     1000000  avgt    6      1.264 ±     0.144  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                         100   
               0.01     1000000  avgt    6      1.629 ±     0.034  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                         100   
                0.1     1000000  avgt    6      1.125 ±     0.085  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                         100   
                0.5     1000000  avgt    6      1.341 ±     0.092  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                         100   
                1.0     1000000  avgt    6      1.161 ±     0.042  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                        1000   
              0.001     1000000  avgt    6      7.746 ±     0.986  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                        1000   
               0.01     1000000  avgt    6     13.783 ±     0.340  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                        1000   
                0.1     1000000  avgt    6     10.462 ±     0.610  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                        1000   
                0.5     1000000  avgt    6      9.629 ±     0.501  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                        1000   
                1.0     1000000  avgt    6      9.134 ±     0.771  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                       10000   
              0.001     1000000  avgt    6     45.663 ±     3.165  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                       10000   
               0.01     1000000  avgt    6     81.316 ±     3.230  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                       10000   
                0.1     1000000  avgt    6    101.034 ±     8.342  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                       10000   
                0.5     1000000  avgt    6     85.989 ±    12.125  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                       10000   
                1.0     1000000  avgt    6     95.657 ±    56.279  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                      100000   
              0.001     1000000  avgt    6    242.897 ±    34.804  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                      100000   
               0.01     1000000  avgt    6    300.715 ±    54.706  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                      100000   
                0.1     1000000  avgt    6    888.095 ±   202.339  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                      100000   
                0.5     1000000  avgt    6    821.560 ±   244.944  us/op
   BenchmarkInvertedIndexDistinct.sortedIndexPath                      100000   
                1.0     1000000  avgt    6    646.489 ±    24.215  us/op
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add inverted-index-based distinct operator with runtime cost heuristic [pinot]

Reply via email to