gianm commented on issue #17139: URL: https://github.com/apache/druid/issues/17139#issuecomment-2374395389
> The controller can consume up to a maximum of 300MB of heap space for the collection of statistics, and this maximum seems a lot given that it will be per query, if compared with the default value of something like `maxSubqueryBytes`. > > For reference, a broker with 20 GB of heap space, and 50 max concurrent queries and no lookups would allocate 0.5 * 1/50 * (20GB) of heap space per query for inlining results which is 200MB, while each Dart query can theoretically take up to 300MBs. This parity will be more for brokers with smaller heap sizes. Do we require some limiting for the brokers at the moment (like we do with subqueries) or would we take this up once we start tuning concurrency? The heap space used by partition statistics is capped to 15% of the overall Broker heap, and is split across all possible controllers based on the maximum Dart concurrency. So a Broker with 20 GB of heap and 50 max concurrent Dart queries would use at most 60 MB for partition statistics per controller. Check out `DartControllerMemoryManagementModule` for where this is set up. Another thing is that I expect we won't be gathering partition statistics for most queries for very long. When these two future-work items are complete, then `globalSort` (and the statistics gathering) will only be needed when a query has an `ORDER BY` without a `LIMIT` at the outer level. Anything else would be able to use hash partitioning with local sorting, skipping stats gathering. > - Multithread `hashLocalSort` shuffles. Currently only one partition is sorted at a time, even on a multithreaded worker. This is the main reason the initial version is using `globalSort` so much, even though `globalSort` involves more overhead on the controller. > - Use hashLocalSort for aggregation rather than `globalSort`, once it's multithreaded, to reduce dependency on the controller and on statistics gathering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
