Rachelint commented on code in PR #22729: URL: https://github.com/apache/datafusion/pull/22729#discussion_r3359869455
########## datafusion/physical-plan/src/aggregates/hash_table.rs: ########## @@ -0,0 +1,615 @@ +// Licensed to the Apache Software Foundation (ASF) under one Review Comment: Will clearer to organize it like instead of single `hash_table.rs`? ``` aggregator - mod.rs - partial.rs - final.rs ``` ########## datafusion/physical-plan/src/aggregates/hash_table.rs: ########## @@ -0,0 +1,615 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::marker::PhantomData; +use std::sync::Arc; + +use arrow::array::{ArrayRef, AsArray, BooleanArray, new_null_array}; +use arrow::datatypes::SchemaRef; +use arrow::record_batch::RecordBatch; +use datafusion_common::{Result, internal_err}; +use datafusion_execution::memory_pool::proxy::VecAllocExt; +use datafusion_expr::{EmitTo, GroupsAccumulator}; + +use super::group_values::{GroupByMetrics, GroupValues, new_group_values}; +use super::order::GroupOrdering; +use super::row_hash::create_group_accumulator; +use super::{ + AggregateExec, PhysicalGroupBy, aggregate_expressions, evaluate_group_by, + group_id_array, max_duplicate_ordinal, +}; +use crate::PhysicalExpr; +use crate::metrics::{MetricBuilder, MetricCategory}; + +/// Marker for raw rows -> partial state aggregation. +pub(super) struct Partial; +/// Marker for partial state -> final value aggregation. +pub(super) struct Final; + +/// Grouped hash table shared by the initial-partial and partial-final paths. +/// +/// While building, it consumes input batches and updates group / accumulator +/// state. While outputting, it incrementally output the materialized batches. +/// +/// # Marker Type +/// `AggrMode` is a zero-sized marker type used with `PhantomData` to keep the +/// initial-partial and partial-final update logic in separate impl blocks. Different +/// stages has different semantics and applicable optimizations, this makes the +/// implementation easier to follow. +pub(super) struct AggregateHashTable<AggrMode> { Review Comment: Seems `AggrMode` is not actually used. I think it may be clearer like that? - define `AggregatorCore`, and place the common methods in it - define `PartialAggregator` and `FinalAggregator` to wrap `AggregatorCore`, and place the specific methods in them ########## datafusion/sqllogictest/test_files/information_schema.slt: ########## @@ -370,6 +371,7 @@ datafusion.execution.batch_size 8192 Default batch size while creating new batch datafusion.execution.coalesce_batches true When set to true, record batches will be examined between each operator and small batches will be coalesced into larger batches. This is helpful when there are highly selective filters or joins that could produce tiny output batches. The target batch size is determined by the configuration setting datafusion.execution.collect_statistics true Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default `ListingTableProvider` in DataFusion. Defaults to true. datafusion.execution.enable_ansi_mode false Whether to enable ANSI SQL mode. The flag is experimental and relevant only for DataFusion Spark built-in functions When `enable_ansi_mode` is set to `true`, the query engine follows ANSI SQL semantics for expressions, casting, and error handling. This means: - **Strict type coercion rules:** implicit casts between incompatible types are disallowed. - **Standard SQL arithmetic behavior:** operations such as division by zero, numeric overflow, or invalid casts raise runtime errors rather than returning `NULL` or adjusted values. - **Consistent ANSI behavior** for string concatenation, comparisons, and `NULL` handling. When `enable_ansi_mode` is `false` (the default), the engine uses a more permissive, non-ANSI mode designed for user convenience and backward compatibility. In this mode: - Implicit casts between types are allowed (e.g., string to integer when possible). - Arithmetic operations are more lenient — for example, `abs()` on the minimum representable integer value returns the input value instead of raising overflow. - Division by zero or invalid casts may return `NULL` instead of failing. # Default `false` — ANSI SQL mode is disabled by default. +datafusion.execution.enable_migration_aggregate true Temporary switch for aggregate stream implementations that are being migrated from `GroupedHashAggregateStream`. When set to true, DataFusion tries the migrated implementations when their preconditions are satisfied. When set to false, grouped aggregation falls back to `GroupedHashAggregateStream`. This option will be removed after the migration is finished. See <https://github.com/apache/datafusion/issues/22710> for details. Review Comment: Seems we can also add fuzzy case for it in `/Users/ruiqiucao/Desktop/github/datafusion/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs` based on this config option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
