Copilot commented on code in PR #22751:
URL: https://github.com/apache/datafusion/pull/22751#discussion_r3353376372
##########
datafusion/physical-plan/src/lib.rs:
##########
@@ -29,6 +29,15 @@
//!
//! Entrypoint of this crate is trait [ExecutionPlan].
+// Wire dhat in as the global allocator only when the optional `dhat-heap`
+// feature is enabled. This lets the `aggregates::group_values::multi_group_by`
+// memory regression tests measure real heap usage with `dhat::Profiler`.
+// The default test run does not pay any profiler overhead because the
+// dependency is gated behind the feature.
+#[cfg(feature = "dhat-heap")]
+#[global_allocator]
+static ALLOC: dhat::Alloc = dhat::Alloc;
Review Comment:
`dhat-heap` currently installs `dhat::Alloc` as the global allocator for
*all* builds of this crate when the feature is enabled. This can cause
compilation failures (multiple `#[global_allocator]` definitions) when
`dhat-heap` is enabled in the dependency graph of binaries/tests that already
set their own allocator (e.g. `datafusion-cli`). Since the allocator is only
needed for the feature-gated unit tests, gate it behind `cfg(test)` as well.
##########
datafusion/physical-plan/src/aggregates/group_values/multi_group_by/mod.rs:
##########
@@ -1820,3 +1912,65 @@ mod tests {
);
}
}
+
+/// Heap regression tests for `GroupValuesColumn`. Gated behind the
+/// `dhat-heap` feature so the default `cargo test` does not pay the
+/// profiler overhead.
+///
+/// Run with:
+/// ```bash
+/// cargo test -p datafusion-physical-plan --features dhat-heap \
+/// aggregates::group_values::multi_group_by::dhat_tests
+/// ```
+///
+/// Each test wraps a representative `GroupValuesColumn::intern` workload
+/// in a `dhat::Profiler` and asserts that `dhat::HeapStats::get().max_bytes`
+/// stays under a fixed budget. The budget is generous in this PR (which
+/// only refactors the dispatch); subsequent PRs that add type
+/// specializations or composition tests should tighten the threshold to
+/// reflect the actual savings their changes deliver.
+#[cfg(all(test, feature = "dhat-heap"))]
+mod dhat_tests {
+ use std::sync::Arc;
+
+ use arrow::array::{ArrayRef, Int64Array, StringArray};
+ use arrow::datatypes::{DataType, Field, Schema};
+
+ use crate::aggregates::group_values::{
+ GroupValues, multi_group_by::GroupValuesColumn,
+ };
+
+ /// Pin the peak heap of a multi-column `GroupValuesColumn` intern at
+ /// roughly the level it sits at today on this branch. The budget is
+ /// intentionally loose (1 MiB on a 1k-row workload) so the test
+ /// triggers only on a real regression, not normal allocator noise.
+ /// Future PRs adding specializations should lower this number.
+ #[test]
+ fn group_values_column_heap_under_budget() {
+ let _profiler = dhat::Profiler::builder().testing().build();
+
+ let schema = Arc::new(Schema::new(vec![
+ Field::new("a", DataType::Int64, false),
+ Field::new("b", DataType::Utf8, false),
+ ]));
+ let n: i64 = 1000;
+ let a: ArrayRef = Arc::new(Int64Array::from_iter_values(0..n));
+ let b: ArrayRef = Arc::new(StringArray::from_iter_values(
+ (0..n).map(|i| format!("row-{i:04}")),
+ ));
+
+ let mut gv =
GroupValuesColumn::<false>::try_new(Arc::clone(&schema)).unwrap();
+ let mut groups = Vec::new();
+ gv.intern(&[a, b], &mut groups).unwrap();
Review Comment:
The heap regression test starts the `dhat::Profiler` before constructing the
input arrays. This makes the budget sensitive to unrelated allocations from
input setup (e.g. `format!("row-{i:04}")`), which can cause noise/flakiness and
makes the test less focused on `GroupValuesColumn` itself. Start the profiler
after the inputs are constructed (but before `try_new` / `intern`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]