zhuqi-lucas commented on code in PR #22751:
URL: https://github.com/apache/datafusion/pull/22751#discussion_r3354069421
##########
datafusion/physical-plan/src/lib.rs:
##########
@@ -29,6 +29,15 @@
//!
//! Entrypoint of this crate is trait [ExecutionPlan].
+// Wire dhat in as the global allocator only when the optional `dhat-heap`
+// feature is enabled. This lets the `aggregates::group_values::multi_group_by`
+// memory regression tests measure real heap usage with `dhat::Profiler`.
+// The default test run does not pay any profiler overhead because the
+// dependency is gated behind the feature.
+#[cfg(feature = "dhat-heap")]
+#[global_allocator]
+static ALLOC: dhat::Alloc = dhat::Alloc;
Review Comment:
Good catch, fixed in the amended push. The global allocator now sits behind
`cfg(all(test, feature = "dhat-heap"))` so downstream crates that transitively
enable the feature in their own bench/test builds will not pick up a
conflicting allocator in their non-test builds.
##########
datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs:
##########
@@ -1820,3 +1912,65 @@ mod tests {
);
}
}
+
+/// Heap regression tests for `GroupValuesColumn`. Gated behind the
+/// `dhat-heap` feature so the default `cargo test` does not pay the
+/// profiler overhead.
+///
+/// Run with:
+/// ```bash
+/// cargo test -p datafusion-physical-plan --features dhat-heap \
+/// aggregates::group_values::multi_group_by::dhat_tests
+/// ```
+///
+/// Each test wraps a representative `GroupValuesColumn::intern` workload
+/// in a `dhat::Profiler` and asserts that `dhat::HeapStats::get().max_bytes`
+/// stays under a fixed budget. The budget is generous in this PR (which
+/// only refactors the dispatch); subsequent PRs that add type
+/// specializations or composition tests should tighten the threshold to
+/// reflect the actual savings their changes deliver.
+#[cfg(all(test, feature = "dhat-heap"))]
+mod dhat_tests {
+ use std::sync::Arc;
+
+ use arrow::array::{ArrayRef, Int64Array, StringArray};
+ use arrow::datatypes::{DataType, Field, Schema};
+
+ use crate::aggregates::group_values::{
+ GroupValues, multi_group_by::GroupValuesColumn,
+ };
+
+ /// Pin the peak heap of a multi-column `GroupValuesColumn` intern at
+ /// roughly the level it sits at today on this branch. The budget is
+ /// intentionally loose (1 MiB on a 1k-row workload) so the test
+ /// triggers only on a real regression, not normal allocator noise.
+ /// Future PRs adding specializations should lower this number.
+ #[test]
+ fn group_values_column_heap_under_budget() {
+ let _profiler = dhat::Profiler::builder().testing().build();
+
+ let schema = Arc::new(Schema::new(vec![
+ Field::new("a", DataType::Int64, false),
+ Field::new("b", DataType::Utf8, false),
+ ]));
+ let n: i64 = 1000;
+ let a: ArrayRef = Arc::new(Int64Array::from_iter_values(0..n));
+ let b: ArrayRef = Arc::new(StringArray::from_iter_values(
+ (0..n).map(|i| format!("row-{i:04}")),
+ ));
+
+ let mut gv =
GroupValuesColumn::<false>::try_new(Arc::clone(&schema)).unwrap();
+ let mut groups = Vec::new();
+ gv.intern(&[a, b], &mut groups).unwrap();
Review Comment:
Fixed. The schema, arrays, GroupValuesColumn, and groups vec are now all
built before `Profiler::builder().testing().build()`. The 1k `format!()` String
allocations and array buffers no longer count against the 1 MiB budget; only
what `intern` itself touches does. Validated locally with the dhat test still
passing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]