[I] String.hash seems not cached unintentionally (druid)

via GitHub Tue, 24 Oct 2023 06:06:59 -0700


takaaki7 opened a new issue, #15242:
URL: https://github.com/apache/druid/issues/15242


   Please provide a detailed title (e.g. "Broker crashes when using TopN query 
with Bound filter" instead of just "Broker crashes").
   
   ### Affected Version
   druid-27.0.0-rc1
   
   ### Description
   
   Please include as much detailed information about the problem as possible.
   
   I'm now using jfr to find out the performance bottleneck of GroupBy query, 
then I found RowBasedGrouperHelper.addToDictionary is expensive because of 
calculating String.hashCode().
   But it's weird because the String.hashCode result must be cached in internal 
String instance.
   Do you know what is the cause of this overhead or workaround to reduce this?
   
   ![スクリーンショット 2023-10-23 15 10 34 
(1)](https://github.com/apache/druid/assets/8406540/7cd5e30f-22ba-4a20-a4f8-329296e2a6ba)
   
   
   Query
   ```
   SELECT user_id
   FROM "event"
   WHERE  __time > '2022-07-01T00:00:00Z' AND __time < '2022-08-31T00:00:00Z'
   AND event_name = 'view'
   GROUP BY user_id
   HAVING COUNT(*) > 5000
   ```
   
   Total rows: 1200m
   user_id cardinality: 20m
   Table is partitioned by day, sharded_by user_id
   
   - Cluster size
     - broker: 2core, 20GB RAM
     - historicals: 24core, 100GB RAM (1server)
   - Configurations in use
   ```json
   {
     "useCache": false,
     "populateCache": false,
     "debug": true,
     "finalize": false,
     "forcePushDownNestedQuery": true,
     "numParallelCombineThreads": 12,
     "bufferGrouperInitialBuckets": 8024
   }
   ```
   
   - Any debugging that you have already done
   
   I've debugged RemoteProcess with jdwp debugger, and I set breakpoint to 
`addToDictionary` with condition the input string's `hash == 0 && hashIsZero == 
false`(with reflection), but couldn't catch such case.
   
   ---
   I've read code, but I cannot find out the cause.(It seems there is no string 
instance recreation and no bypass)
   
   Related function calls:
   RowBasedGrouperHelper accumulator calls grouper.aggregate(new 
RowBasedKey(key))
   
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L342
   
   Grouper calculate hashcode.
   
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/Grouper.java#L82
   
   Then RowBasedKey.hashCode() is called. Arrays.hashCode(key) call each 
element's hashcode() internaly, then string instance hashcode must be cached 
here.
   
https://github.com/takaaki7/druid/blob/9d92a663f8e3964cf23d93259f52d6fb9137d5b9/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L708
   
   And key instance is passed to 
DynamicDictionaryStringRowBasedKeySerdeHelper.addToDictionary without no 
instance copy. (ConcurrentGrouper.aggregate() -> SpillingGrouper.aggregate() -> 
AbstractBufferHashGrouper.aggregate() -> RowBasedKeySerde.toByteBuffer() -> 
DynamicDictionaryStringRowBasedKeySerdeHelper.addToDictionary())
   
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L1713


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] String.hash seems not cached unintentionally (druid)

Reply via email to