Re: [PR] refactor(hash-aggr): Simplify aggregate hash table [datafusion]

via GitHub Fri, 03 Jul 2026 22:03:02 -0700


2010YOUY01 commented on code in PR #23309:
URL: https://github.com/apache/datafusion/pull/23309#discussion_r3522601758



##########
datafusion/physical-plan/src/aggregates/aggregate_hash_table/common.rs:
##########
@@ -60,19 +72,36 @@ pub(in crate::aggregates) struct FinalMarker;
 /// [`GroupsAccumulator`]. Both use columnar storage so aggregation can stay
 /// vectorized.
 ///
-/// # Marker Type
-/// `AggrMode` selects the aggregate semantics.
+/// # Mode
+///
+/// [`AggregateTableMode`] controls how input batches update accumulator state
+/// and how output batches are materialized.
 ///
-/// e.g. `AggregateHashTable::<PartialMarker>::new(...)` creates an aggregate 
hash table
-/// for the partial hash aggregate stage, the input schema is raw rows and 
output
-/// schema is intermediate states.
+/// ```text
+/// Example: `AVG(x) GROUP BY k`.
 ///
-/// It is a zero-sized compile-time marker, so each stage keeps its update 
logic
-/// in a separate impl block, to make the behavior difference explicit.
-pub(in crate::aggregates) struct AggregateHashTable<AggrMode> {
+/// In `Partial` mode, the table stores partial state:
+///     k, sum(x), count(x)
+/// The input batch contains raw values:
+///     k, x
+/// The output batch also contains partial state:
+///     k, sum(x), count(x)
+/// ```
+///
+/// So input uses [`GroupsAccumulator::update_batch`], and output uses
+/// [`GroupsAccumulator::state`].
+///
+/// Other modes use different input/output combinations:
+/// - `Final`: merge_batch + evaluate
+/// - `PartialReduce`: merge_batch + state
+/// - `Single`: update_batch + evaluate
+pub(in crate::aggregates) struct AggregateHashTable {

Review Comment:
   ### For Proposed alternative
   
   I have considered this approach: separating aggregate hash tables by mode 
and extracting shared internal utilities.
   
   My concern is that this would create a [shallow 
module](https://softengbook.org/articles/deep-modules). To fully understand the 
implementation, we would still need to reason about the same underlying mode 
differences, while the external `AggregateHashTable` variants would add 
complexity with almost no functional contribution.
   
   So I think keeping the mode as an internal flag is simpler overall.
   
   ### Complexity concern
   
   For the concern about growing internal complexity, see the rationale in the 
issue. I think this should remain manageable in the long term.
   
   ### Partial skip table
   
   Yes, this is a valid concern. The partial skip table logic could be moved 
out of this module entirely to make the structure cleaner. However, I think 
this can be delayed a bit, and I plan to do it as a follow-up.
   https://github.com/apache/datafusion/issues/23113



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] refactor(hash-aggr): Simplify aggregate hash table [datafusion]

Reply via email to