[GitHub] [arrow-datafusion] mingmwang commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

via GitHub Mon, 26 Jun 2023 21:56:07 -0700


mingmwang commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1608796360

@alamb @tustvold

If I understand correctly, in this new proposal, we will not use the `Row
Layout` to manage the `Accumulator` states, right?
I think the `Row Layout` is still useful in the case of high cardinality
grouping along with multiple `Accumulators` (avg, min, max,..). With the `Row
Layout`, the states of multiple `Accumulators` belong to the same group can be
updated in the tight loop. If we let each `Accumulator` manages its own state,
I guess there will be more random memory access.

And below is my observation:

1). The per-group type dispatching overhead is not the major bottleneck,
that's also why the PR #6657 only get 10%~ 15%
performance gain.
2). The random memory access and mem stall should be the major bottleneck
in the case of high cardinality grouping .

And I checked DuckDB's source code, looks like they also use the `Row
Layout` to manage the aggregator states.
They use C++ templates in the state update functions also. I'm going to take
a closer look at the source code this week and next week.

https://github.com/duckdb/duckdb/blob/e49cf8132dc07324e0bb165593c8160837fcef3a/src/include/duckdb/common/types/row/row_layout.hpp

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

Reply via email to