This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 34efd1fbae More comment to aggregation fuzzer (#15048)
34efd1fbae is described below
commit 34efd1fbae39eb0441a43ab976fc23001d1f674a
Author: Yongting You <[email protected]>
AuthorDate: Fri Mar 7 00:24:22 2025 +0800
More comment to aggregation fuzzer (#15048)
---
.../aggregation_fuzzer/data_generator.rs | 23 +++++++++++++++++++++-
.../tests/fuzz_cases/aggregation_fuzzer/mod.rs | 20 +++++++++++++++++++
2 files changed, 42 insertions(+), 1 deletion(-)
diff --git
a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
index 4d4c6aa793..54c5744c86 100644
--- a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
+++ b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
@@ -100,7 +100,28 @@ impl DatasetGeneratorConfig {
/// Dataset generator
///
-/// It will generate one random [`Dataset`] when `generate` function is called.
+/// It will generate random [`Dataset`]s when the `generate` function is
called. For each
+/// sort key in `sort_keys_set`, an additional sorted dataset will be
generated, and the
+/// dataset will be chunked into staggered batches.
+///
+/// # Example
+/// For `DatasetGenerator` with `sort_keys_set = [["a"], ["b"]]`, it will
generate 2
+/// datasets. The first one will be sorted by column `a` and get randomly
chunked
+/// into staggered batches. It might look like the following:
+/// ```text
+/// a b
+/// ----
+/// 1 2 <-- batch 1
+/// 1 1
+///
+/// 2 1 <-- batch 2
+///
+/// 3 3 <-- batch 3
+/// 4 3
+/// 4 1
+/// ```
+///
+/// # Implementation details:
///
/// The generation logic in `generate`:
///
diff --git a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
index 7c5b25e4a0..1e42ac1f4b 100644
--- a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
+++ b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
@@ -15,6 +15,26 @@
// specific language governing permissions and limitations
// under the License.
+//! Fuzzer for aggregation functions
+//!
+//! The main idea behind aggregate fuzzing is: for aggregation, DataFusion has
many
+//! specialized implementations for performance. For example, when the group
cardinality
+//! is high, DataFusion will skip the first stage of two-stage hash
aggregation; when
+//! the input is ordered by the group key, there is a separate implementation
to perform
+//! streaming group by.
+//! This fuzzer checks the results of different specialized implementations and
+//! ensures their results are consistent. The execution path can be controlled
by
+//! changing the input ordering or by setting related configuration parameters
in
+//! `SessionContext`.
+//!
+//! # Architecture
+//! - `aggregate_fuzz.rs` includes the entry point for fuzzer runs.
+//! - `QueryBuilder` is used to generate candidate queries.
+//! - `DatasetGenerator` is used to generate random datasets.
+//! - `SessionContextGenerator` is used to generate `SessionContext` with
+//! different configuration parameters to control the execution path of
aggregate
+//! queries.
+
use arrow::array::RecordBatch;
use arrow::util::pretty::pretty_format_batches;
use datafusion::prelude::SessionContext;
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]