takaaki7 opened a new issue, #15242: URL: https://github.com/apache/druid/issues/15242
Please provide a detailed title (e.g. "Broker crashes when using TopN query with Bound filter" instead of just "Broker crashes"). ### Affected Version druid-27.0.0-rc1 ### Description Please include as much detailed information about the problem as possible. I'm now using jfr to find out the performance bottleneck of GroupBy query, then I found RowBasedGrouperHelper.addToDictionary is expensive because of calculating String.hashCode(). But it's weird because the String.hashCode result must be cached in internal String instance. Do you know what is the cause of this overhead or workaround to reduce this?  Query ``` SELECT user_id FROM "event" WHERE __time > '2022-07-01T00:00:00Z' AND __time < '2022-08-31T00:00:00Z' AND event_name = 'view' GROUP BY user_id HAVING COUNT(*) > 5000 ``` Total rows: 1200m user_id cardinality: 20m Table is partitioned by day, sharded_by user_id - Cluster size - broker: 2core, 20GB RAM - historicals: 24core, 100GB RAM (1server) - Configurations in use ```json { "useCache": false, "populateCache": false, "debug": true, "finalize": false, "forcePushDownNestedQuery": true, "numParallelCombineThreads": 12, "bufferGrouperInitialBuckets": 8024 } ``` - Any debugging that you have already done I've debugged RemoteProcess with jdwp debugger, and I set breakpoint to `addToDictionary` with condition the input string's `hash == 0 && hashIsZero == false`(with reflection), but couldn't catch such case. --- I've read code, but I cannot find out the cause.(It seems there is no string instance recreation and no bypass) Related function calls: RowBasedGrouperHelper accumulator calls grouper.aggregate(new RowBasedKey(key)) https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L342 Grouper calculate hashcode. https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/Grouper.java#L82 Then RowBasedKey.hashCode() is called. Arrays.hashCode(key) call each element's hashcode() internaly, then string instance hashcode must be cached here. https://github.com/takaaki7/druid/blob/9d92a663f8e3964cf23d93259f52d6fb9137d5b9/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L708 And key instance is passed to DynamicDictionaryStringRowBasedKeySerdeHelper.addToDictionary without no instance copy. (ConcurrentGrouper.aggregate() -> SpillingGrouper.aggregate() -> AbstractBufferHashGrouper.aggregate() -> RowBasedKeySerde.toByteBuffer() -> DynamicDictionaryStringRowBasedKeySerdeHelper.addToDictionary()) https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java#L1713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
