[PR] [Groupby Query Metrics] Add merge buffer tracking (druid)

via GitHub Sun, 09 Nov 2025 19:18:39 -0800


GWphua opened a new pull request, #18731:
URL: https://github.com/apache/druid/pull/18731

Fixes #17902

<!-- If you are a committer, follow the PR action item checklist for
committers:

https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers.
-->

### Description

#### Tracking merge buffer usage
- Usage of a direct byte buffer is done under `AbstractBufferHashGrouper`
and its implementations.
1. Each direct byte buffer uses a `ByteBufferHashTable` along with an offset
tracker.
2. Usage is calculated by tracking the maximum capacity of the byte buffer
in `ByteBufferHashTable`, and maximum offset size calculated throughout the
query's lifecycle.

Calculations are done by taking the maximum throughout the query, so
operators can better understand how big the merge buffers can be configured.

#### Release note
<!-- Give your best effort to summarize your changes in a couple of
sentences aimed toward Druid users.

If your change doesn't have end user impact, you can skip this section.

For tips about how to write a good release note, see [Release
notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).

-->

<hr>

##### Key changed/added classes in this PR
* `GroupByStatsProvider`

<hr>

This PR has:

- [x] been self-reviewed.
- [x] a release note entry in the PR description.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster.

### Future Plans
While building this PR, I have come across some further enhancements that
can be introduced in the future:

#### Adding per-query-stats to query logs
Adding per-query information to query logs makes sense to me.

#### Per-buffer metrics
The current metric is great, but will not report accurately for nested
group-by's. As far as I know, nested groupby limits the merge buffers usage
count to 2, meaning the merge buffer will be re-used, and a per-query metric
will likely over-report the merge buffer usage.

I feel it is nice to report per-buffer usage, instead of per-query usage.

#### Simplify Memory Management
Right now we need to configure the following for each queryable service:
1. size of merge buffer
2. number of merge buffer
3. direct memory = (numProcessingThreads + numMergeBuffer + 1) *
mergeBufferSizeBytes

It will be great if we can simplify the calculations down to simply
configuring direct memory, and we can manage a memory pool instead. This allows
for more flexibility (unused memory allocated for merge buffers may be used by
processing threads instead).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Groupby Query Metrics] Add merge buffer tracking (druid)

Reply via email to