(datafusion) branch main updated: More comment to aggregation fuzzer (#15048)

alamb Thu, 06 Mar 2025 08:25:55 -0800

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new 34efd1fbae More comment to aggregation fuzzer (#15048)
34efd1fbae is described below

commit 34efd1fbae39eb0441a43ab976fc23001d1f674a
Author: Yongting You <[email protected]>
AuthorDate: Fri Mar 7 00:24:22 2025 +0800

    More comment to aggregation fuzzer (#15048)
---
 .../aggregation_fuzzer/data_generator.rs           | 23 +++++++++++++++++++++-
 .../tests/fuzz_cases/aggregation_fuzzer/mod.rs     | 20 +++++++++++++++++++
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git 
a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs 
b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
index 4d4c6aa793..54c5744c86 100644
--- a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
+++ b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
@@ -100,7 +100,28 @@ impl DatasetGeneratorConfig {
 
 /// Dataset generator
 ///
-/// It will generate one random [`Dataset`] when `generate` function is called.
+/// It will generate random [`Dataset`]s when the `generate` function is 
called. For each
+/// sort key in `sort_keys_set`, an additional sorted dataset will be 
generated, and the
+/// dataset will be chunked into staggered batches.
+///
+/// # Example
+/// For `DatasetGenerator` with `sort_keys_set = [["a"], ["b"]]`, it will 
generate 2
+/// datasets. The first one will be sorted by column `a` and get randomly 
chunked
+/// into staggered batches. It might look like the following:
+/// ```text
+/// a b
+/// ----
+/// 1 2 <-- batch 1
+/// 1 1
+///
+/// 2 1 <-- batch 2
+///
+/// 3 3 <-- batch 3
+/// 4 3
+/// 4 1
+/// ```
+///
+/// # Implementation details:
 ///
 /// The generation logic in `generate`:
 ///
diff --git a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs 
b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
index 7c5b25e4a0..1e42ac1f4b 100644
--- a/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
+++ b/datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs
@@ -15,6 +15,26 @@
 // specific language governing permissions and limitations
 // under the License.
 
+//! Fuzzer for aggregation functions
+//!
+//! The main idea behind aggregate fuzzing is: for aggregation, DataFusion has 
many
+//! specialized implementations for performance. For example, when the group 
cardinality
+//! is high, DataFusion will skip the first stage of two-stage hash 
aggregation; when
+//! the input is ordered by the group key, there is a separate implementation 
to perform
+//! streaming group by.
+//! This fuzzer checks the results of different specialized implementations and
+//! ensures their results are consistent. The execution path can be controlled 
by
+//! changing the input ordering or by setting related configuration parameters 
in
+//! `SessionContext`.
+//!
+//! # Architecture
+//! - `aggregate_fuzz.rs` includes the entry point for fuzzer runs.
+//! - `QueryBuilder` is used to generate candidate queries.
+//! - `DatasetGenerator` is used to generate random datasets.
+//! - `SessionContextGenerator` is used to generate `SessionContext` with
+//!   different configuration parameters to control the execution path of 
aggregate
+//!   queries.
+
 use arrow::array::RecordBatch;
 use arrow::util::pretty::pretty_format_batches;
 use datafusion::prelude::SessionContext;


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion) branch main updated: More comment to aggregation fuzzer (#15048)

Reply via email to