Re: [PR] Optimise indexer resource usage based on CPU profile (druid)

via GitHub Fri, 31 May 2024 03:14:01 -0700


kgyrtkirk commented on code in PR #16517:
URL: https://github.com/apache/druid/pull/16517#discussion_r1622175524



##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -132,12 +142,14 @@ public EncodedKeyComponent<int[]> 
processRowValsToUnsortedEncodedKeyComponent(@N
     } else if (dimValues instanceof byte[]) {
       encodedDimensionValues =
           new 
int[]{dimLookup.add(emptyToNullIfNeeded(StringUtils.encodeBase64String((byte[]) 
dimValues)))};
+      dictionaryChanged = true;
     } else {
       encodedDimensionValues = new 
int[]{dimLookup.add(emptyToNullIfNeeded(dimValues))};
+      dictionaryChanged = true;
     }
 
     // If dictionary size has changed, the sorted lookup is no longer valid.
-    if (oldDictSize != dimLookup.size()) {
+    if (dictionaryChanged) {

Review Comment:
   only the `add` method are called on `dimLookup` above - which naturally 
changes the `size` ; I'm not sure what the addition of this `boolean` adds - as 
it must be set to true after all `add` calls.
   
   Actually since the `add` call may even [bail out and do 
nothing](https://github.com/apache/druid/blob/3c72ec8413fc066dbb56e9f02d6fbc6e3b28bbe1/processing/src/main/java/org/apache/druid/segment/DimensionDictionary.java#L164)
 in some cases: Theoretically this may force `sortedLookup` to become `null` in 
cases in which earlier it would not done that - which seems like a behaviour 
change...
   
   ...and because of the above I wonder how does the introduction of this 
boolean improves performance?



##########
processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnMerger.java:
##########
@@ -322,7 +322,7 @@ public void processMergedRow(ColumnValueSelector selector) 
throws IOException
     if (encodedValueSerializer instanceof ColumnarMultiIntsSerializer) {
       ((ColumnarMultiIntsSerializer) encodedValueSerializer).addValues(row);
     } else {
-      int value = row.size() == 0 ? 0 : row.get(0);
+      int value = rowSize == 0 ? 0 : row.get(0);

Review Comment:
   this seems good to me :)
   from the attached screenshot it appears to me that mainly the 
`dimSelectorWithUnsortedValues.getRow()` method is the bottleneck ; when its 
called from `SortedDimentsionSelector`
   
   
   as `get` also calls that method - wouldn't it be a possibly more complete 
fix to either precompute ; or lazily store the computed value of 
`dimSelectorWithUnsortedValues.getRow()` in the 
`DictionaryEncordedColumnIndexer$SortedDimensionSelector` class? that could 
cache the row built by `StringDimensionIndexer$IndexerDimensionSelector#getRow` 
- which builds an `ArrayBasedIndexedInts` which has the `size` flattened in.
   
   this could enable the same optimization to seamlessly flow to other places 
as well :)



##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -83,7 +83,7 @@ public StringDimensionIndexer(
   public EncodedKeyComponent<int[]> 
processRowValsToUnsortedEncodedKeyComponent(@Nullable Object dimValues, boolean 
reportParseExceptions)
   {
     final int[] encodedDimensionValues;
-    final int oldDictSize = dimLookup.size();
+    volatile boolean dictionaryChanged = false;

Review Comment:
   I'm a bit confused by this `volatile` - whats the reasoning behind the usage 
of that?
   I don't see how this local variable can be accessed from multiple threads...
   



##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -94,13 +94,21 @@ public EncodedKeyComponent<int[]> 
processRowValsToUnsortedEncodedKeyComponent(@N
     if (dimValues == null) {
       final int nullId = dimLookup.getId(null);
       encodedDimensionValues = nullId == DimensionDictionary.ABSENT_VALUE_ID ? 
new int[]{dimLookup.add(null)} : new int[]{nullId};
+      if (nullId == DimensionDictionary.ABSENT_VALUE_ID) {
+        encodedDimensionValues = new int[]{dimLookup.add(null)};
+        dictionaryChanged = true;
+      } else {
+        encodedDimensionValues = new int[]{nullId};
+      }

Review Comment:
   with the addition of these lines I think `dimLookup.add(null)` will be 
called more than before
   
   seems like the `?:` version of this if was not removed?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimise indexer resource usage based on CPU profile (druid)

Reply via email to