kgyrtkirk commented on code in PR #16517:
URL: https://github.com/apache/druid/pull/16517#discussion_r1622175524
##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -132,12 +142,14 @@ public EncodedKeyComponent<int[]>
processRowValsToUnsortedEncodedKeyComponent(@N
} else if (dimValues instanceof byte[]) {
encodedDimensionValues =
new
int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[])
dimValues)))};
+ dictionaryChanged = true;
} else {
encodedDimensionValues = new
int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))};
+ dictionaryChanged = true;
}
// If dictionary size has changed, the sorted lookup is no longer valid.
- if (oldDictSize != dimLookup.size()) {
+ if (dictionaryChanged) {
Review Comment:
only the `add` method are called on `dimLookup` above - which naturally
changes the `size` ; I'm not sure what the addition of this `boolean` adds - as
it must be set to true after all `add` calls.
Actually since the `add` call may even [bail out and do
nothing](https://github.com/apache/druid/blob/3c72ec8413fc066dbb56e9f02d6fbc6e3b28bbe1/processing/src/main/java/org/apache/druid/segment/DimensionDictionary.java#L164)
in some cases: Theoretically this may force `sortedLookup` to become `null` in
cases in which earlier it would not done that - which seems like a behaviour
change...
...and because of the above I wonder how does the introduction of this
boolean improves performance?
##########
processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnMerger.java:
##########
@@ -322,7 +322,7 @@ public void processMergedRow(ColumnValueSelector selector)
throws IOException
if (encodedValueSerializer instanceof ColumnarMultiIntsSerializer) {
((ColumnarMultiIntsSerializer) encodedValueSerializer).addValues(row);
} else {
- int value = row.size() == 0 ? 0 : row.get(0);
+ int value = rowSize == 0 ? 0 : row.get(0);
Review Comment:
this seems good to me :)
from the attached screenshot it appears to me that mainly the
`dimSelectorWithUnsortedValues.getRow()` method is the bottleneck ; when its
called from `SortedDimentsionSelector`
as `get` also calls that method - wouldn't it be a possibly more complete
fix to either precompute ; or lazily store the computed value of
`dimSelectorWithUnsortedValues.getRow()` in the
`DictionaryEncordedColumnIndexer$SortedDimensionSelector` class? that could
cache the row built by `StringDimensionIndexer$IndexerDimensionSelector#getRow`
- which builds an `ArrayBasedIndexedInts` which has the `size` flattened in.
this could enable the same optimization to seamlessly flow to other places
as well :)
##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -83,7 +83,7 @@ public StringDimensionIndexer(
public EncodedKeyComponent<int[]>
processRowValsToUnsortedEncodedKeyComponent(@Nullable Object dimValues, boolean
reportParseExceptions)
{
final int[] encodedDimensionValues;
- final int oldDictSize = dimLookup.size();
+ volatile boolean dictionaryChanged = false;
Review Comment:
I'm a bit confused by this `volatile` - whats the reasoning behind the usage
of that?
I don't see how this local variable can be accessed from multiple threads...
##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -94,13 +94,21 @@ public EncodedKeyComponent<int[]>
processRowValsToUnsortedEncodedKeyComponent(@N
if (dimValues == null) {
final int nullId = dimLookup.getId(null);
encodedDimensionValues = nullId == DimensionDictionary.ABSENT_VALUE_ID ?
new int[]{dimLookup.add(null)} : new int[]{nullId};
+ if (nullId == DimensionDictionary.ABSENT_VALUE_ID) {
+ encodedDimensionValues = new int[]{dimLookup.add(null)};
+ dictionaryChanged = true;
+ } else {
+ encodedDimensionValues = new int[]{nullId};
+ }
Review Comment:
with the addition of these lines I think `dimLookup.add(null)` will be
called more than before
seems like the `?:` version of this if was not removed?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]