[GitHub] [druid] jasonk000 opened a new pull request #12105: perf: indexing: Introduce a bulk getValuesInto function to read values

GitBox Wed, 29 Dec 2021 16:33:44 -0800


jasonk000 opened a new pull request #12105:
URL: https://github.com/apache/druid/pull/12105



   ### Description
   
   If large number of values are required from `DimensionDictionary` during 
indexing, fetch them all in a single lock/unlock call instead of lock/unlock 
each individual item.
   
   During indexing there are repeated lock/unlock boundary crossing. In a 
sample application (57 fields to index), this consumes ~9% of the taskrunner 
CPU.
   
   Depending on indexing row configuration specifics, the indexer usage of 
`DimensionDictionary` can consume anywhere from 1-20% of the CPU time during 
processing. This PR addresses one aspect of the processing, specifically the 
getValues. (I'll introduce another PR on the add/size change).
   
   #### Introduce a `getValuesInto()` call to the `DimensionDictionary`, and 
use it
   
   This introduces a `getValuesInto` call, which accepts an array of IDs to 
fetch, and performs the equivalent of many `getValue()` calls.
   
   #### Expand benchmarks to cover wider rows and some concurrency
   
   ### Design option
   
   There is one other design option, that is in fact much faster again, but is 
a bit less intuitive.
   
   Here are the benchmark results across all three:
   
   ```
   single getValue()
   Benchmark                                                                  
(cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize              
      10000          8  avgt   10  0.046 ±  0.001  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize              
      10000         40  avgt   10  0.215 ±  0.007  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads    
      10000          8  avgt   10  3.484 ±  0.032  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads    
      10000         40  avgt   10  8.638 ±  0.514  us/op
   
   bulk getValuesInto() <----- THIS SOLUTION
   Benchmark                                                                  
(cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize              
      10000          8  avgt   10  0.039 ±  0.001  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize              
      10000         40  avgt   10  0.120 ±  0.002  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads    
      10000          8  avgt   10  0.383 ±  0.052  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads    
      10000         40  avgt   10  0.386 ±  0.018  us/op
   
   using doInsideReadLock() <----- ALTERNATIVE, FASTER BUT LESS CLEAN
   Benchmark                                                                  
(cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize              
      10000          8  avgt   10  0.018 ±  0.001  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize              
      10000         40  avgt   10  0.077 ±  0.002  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads    
      10000          8  avgt   10  0.241 ±  0.004  us/op
   StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads    
      10000         40  avgt   10  0.486 ±  0.002  us/op
   ```
   
   ### Alternatives (faster!)
   
   The alternative, `doInsideReadLoop()` is to pass a lambda / closure into the 
`DimensionDictionary` boundary and have it perform locking and execute on the 
other side. This is likely to be faster for most cases, however, it's a bit 
less clear in the API (I'll leave an extra comment).
   
   The solution involves similar changes, and passing a closure across the 
boundary. However, it does leak implementation concerns...
   
   ```
   /************ DimensionDictionary.java **************/
   public void doInsideReadLock(BiConsumer<List<T>, Integer> fn)
   {
     lock.readLock().lock();
     try {
       fn.accept(idToValue, idForNull);
     }
     finally {
       lock.readLock().unlock();
     }
   }
   
   
   /************ StringDimensionIndexer.java **************/
   @Override
   public long estimateEncodedKeyComponentSize(int[] key)
   {
     int[] estimatedSize = new int[]{key.length * Integer.BYTES};
     dimLookup.doInsideReadLock((List<String> idToValue, Integer idForNull) -> {
       for (int id : key) {
         if (id != idForNull) {
             String val = idToValue.get(id);
             int sizeOfString = 28 + 16 + (2 * val.length());
             estimatedSize[0] += sizeOfString;
         }
       }
     });
     return estimatedSize[0];
   }
   ```
   
   
   Otherwise, I think the changes are straightforward!
   
   <hr>
   
   
   This PR has:
   - [x] been self-reviewed.
      - [x] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster. (as part of other changes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jasonk000 opened a new pull request #12105: perf: indexing: Introduce a bulk getValuesInto function to read values

Reply via email to