jasonk000 opened a new pull request #12105:
URL: https://github.com/apache/druid/pull/12105
### Description
If large number of values are required from `DimensionDictionary` during
indexing, fetch them all in a single lock/unlock call instead of lock/unlock
each individual item.
During indexing there are repeated lock/unlock boundary crossing. In a
sample application (57 fields to index), this consumes ~9% of the taskrunner
CPU.
Depending on indexing row configuration specifics, the indexer usage of
`DimensionDictionary` can consume anywhere from 1-20% of the CPU time during
processing. This PR addresses one aspect of the processing, specifically the
getValues. (I'll introduce another PR on the add/size change).
#### Introduce a `getValuesInto()` call to the `DimensionDictionary`, and
use it
This introduces a `getValuesInto` call, which accepts an array of IDs to
fetch, and performs the equivalent of many `getValue()` calls.
#### Expand benchmarks to cover wider rows and some concurrency
### Design option
There is one other design option, that is in fact much faster again, but is
a bit less intuitive.
Here are the benchmark results across all three:
```
single getValue()
Benchmark
(cardinality) (rowSize) Mode Cnt Score Error Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize
10000 8 avgt 10 0.046 ± 0.001 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize
10000 40 avgt 10 0.215 ± 0.007 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads
10000 8 avgt 10 3.484 ± 0.032 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads
10000 40 avgt 10 8.638 ± 0.514 us/op
bulk getValuesInto() <----- THIS SOLUTION
Benchmark
(cardinality) (rowSize) Mode Cnt Score Error Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize
10000 8 avgt 10 0.039 ± 0.001 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize
10000 40 avgt 10 0.120 ± 0.002 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads
10000 8 avgt 10 0.383 ± 0.052 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads
10000 40 avgt 10 0.386 ± 0.018 us/op
using doInsideReadLock() <----- ALTERNATIVE, FASTER BUT LESS CLEAN
Benchmark
(cardinality) (rowSize) Mode Cnt Score Error Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize
10000 8 avgt 10 0.018 ± 0.001 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize
10000 40 avgt 10 0.077 ± 0.002 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads
10000 8 avgt 10 0.241 ± 0.004 us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads
10000 40 avgt 10 0.486 ± 0.002 us/op
```
### Alternatives (faster!)
The alternative, `doInsideReadLoop()` is to pass a lambda / closure into the
`DimensionDictionary` boundary and have it perform locking and execute on the
other side. This is likely to be faster for most cases, however, it's a bit
less clear in the API (I'll leave an extra comment).
The solution involves similar changes, and passing a closure across the
boundary. However, it does leak implementation concerns...
```
/************ DimensionDictionary.java **************/
public void doInsideReadLock(BiConsumer<List<T>, Integer> fn)
{
lock.readLock().lock();
try {
fn.accept(idToValue, idForNull);
}
finally {
lock.readLock().unlock();
}
}
/************ StringDimensionIndexer.java **************/
@Override
public long estimateEncodedKeyComponentSize(int[] key)
{
int[] estimatedSize = new int[]{key.length * Integer.BYTES};
dimLookup.doInsideReadLock((List<String> idToValue, Integer idForNull) -> {
for (int id : key) {
if (id != idForNull) {
String val = idToValue.get(id);
int sizeOfString = 28 + 16 + (2 * val.length());
estimatedSize[0] += sizeOfString;
}
}
});
return estimatedSize[0];
}
```
Otherwise, I think the changes are straightforward!
<hr>
This PR has:
- [x] been self-reviewed.
- [x] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [x] been tested in a test Druid cluster. (as part of other changes)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]