zhuqi-lucas opened a new pull request, #22751: URL: https://github.com/apache/datafusion/pull/22751
# Which issue does this PR close? PR 1 of 5 from the split agreed on https://github.com/apache/datafusion/pull/22706#issuecomment-4602892722. Related to #22682 / #22715 (full `GroupValuesColumn` type coverage). Closes nothing on its own; lays the dispatcher foundation. # Rationale for this change Preparation PR for the nested-type `GroupColumn` work. No new builders here. The goal is to refactor the dispatch in `GroupValuesColumn::intern` so subsequent PRs that add `FixedSizeList` / `Struct` / `List` / `LargeList` support can plug into one well-defined factory instead of growing the inline match in `intern`. Also includes two adjacent correctness fixes around the `Time32` / `Time64` variants that came up in the upstream thread. # What changes are included in this PR? 1. **Factory extraction.** The inline match that maps each schema field to a `Box<dyn GroupColumn>` builder moves out of `GroupValuesColumn::intern` into a free function `make_group_column(field: &Field) -> Result<Box<dyn GroupColumn>>`. `intern` becomes a one-liner loop. Future nested-type specializations recursively call this factory for child field construction without enumerating every combination inline. 2. **Time32 / Time64 supported_type alignment.** Previously `supported_type` matched `Time32(_)` (admitting the invalid Microsecond / Nanosecond combinations) and did not match `Time64(_)` at all, while the dispatcher accepted `Time32(Second / Millisecond)` and `Time64(Microsecond / Nanosecond)`. Tighten `supported_type` to the exact set the dispatcher constructs. The dispatcher's wildcard arms for invalid `Time` variants now return `not_impl_err` instead of silently doing nothing, so a schema that somehow reaches the dispatcher fails loudly rather than producing an empty builder vector. 3. **`supported_type` ↔ `make_group_column` consistency fuzz.** New unit test `supported_type_and_make_group_column_stay_in_sync` iterates a representative set of 20 supported and 6 unsupported `DataType`s and asserts the biconditional. Pins the alignment so future contributors who add a type to one side without the other trip a unit test immediately. 4. **dhat heap regression harness behind a new `dhat-heap` feature.** Adds an optional `dhat = "0.3"` dependency, a `dhat-heap` feature in `physical-plan/Cargo.toml`, a feature-gated global allocator in `lib.rs`, and a `dhat_tests` submodule that wraps a representative `GroupValuesColumn::intern` workload in `dhat::Profiler` and asserts peak heap stays under a 1 MiB budget on a 1k-row 2-col workload. Default `cargo test` does not compile or run dhat. Subsequent type-support PRs should tighten the budget to reflect their savings. # Are these changes tested? - `cargo test -p datafusion-physical-plan --lib aggregates::group_values` passes 30 tests (27 existing + 3 new: the consistency fuzz, the mixed-schema rejection, and the NotImpl propagation). - `cargo test -p datafusion-physical-plan --lib aggregates` passes 105 tests (no regression in the broader aggregate suite). - `cargo test -p datafusion-physical-plan --features dhat-heap aggregates::group_values::multi_group_by::dhat_tests` passes 1 test with peak heap well under the 1 MiB budget. - `cargo clippy -p datafusion-physical-plan --lib --tests --features dhat-heap -- -D warnings` is clean. # Are there any user-facing changes? No semantic change for any schema that already used `GroupValuesColumn` on main. The factory is the same dispatch logic pulled into a function; the `Time` changes only affect schemas that are semantically invalid Arrow types anyway. The `dhat-heap` feature is new but off by default, so consumers that do not enable it pay no overhead. # What this PR is NOT doing - No new `GroupColumn` builders. `FixedSizeList`, `Struct`, `List`, `LargeList` come in subsequent PRs of the #22682 sequence. - No new types in `supported_type`'s allow-list (the Time tightening removes invalid combinations rather than adding new ones). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
