kosiew opened a new pull request, #19476:
URL: https://github.com/apache/datafusion/pull/19476

   ## Which issue does this PR close?
   
   * Closes #18989.
   
   ## Rationale for this change
   
   A panic could occur during `SanityCheckPlan` when optimizing / executing 
queries that chain multiple aggregations over a multi-partitioned input (often 
empty tables), e.g. `Sort -> Aggregate(ts, region) -> Sort -> Aggregate(ts)`.
   
   In these cases the first aggregation can produce a hash partitioning that is 
a **superset** of the second aggregation’s required distribution keys. The 
optimizer previously treated some of these situations as not needing a hash 
repartition (especially when the input was empty / stats suggested repartition 
wasn’t beneficial), which could leave the plan violating the downstream 
operator’s distribution requirement and trigger `SanityCheckPlan` failures.
   
   This PR aligns the optimizer’s distribution enforcement decisions with the 
same partitioning satisfaction logic used by the sanity checker, ensuring that 
required repartitions are inserted whenever the downstream requirement is not 
satisfied.
   
   ## What changes are included in this PR?
   
   * Introduced `DistributionSatisfactionResult` and a shared 
`distribution_satisfaction(...)` helper in `enforce_distribution` to compute 
and carry:
   
     * the required `Distribution`
     * the child’s `output_partitioning`
     * the computed `PartitioningSatisfaction` (Exact / Subset / NotSatisfied)
     * a single, reusable `requires_repartition(...)` decision function.
   * Updated `add_hash_on_top` and `ensure_distribution` to use the shared 
satisfaction result and decision logic, rather than ad-hoc checks.
   * Simplified `get_repartition_requirement_status` by removing alignment 
logic based on `hash_necessary` flags and instead deferring to the 
satisfaction-based repartition decision.
   * Updated `SanityCheckPlan` to reuse the same 
`distribution_satisfaction(...)` helper (so optimizer and sanity checker 
evaluate distribution satisfaction consistently).
   * Re-exported `PartitioningSatisfaction` from `datafusion_physical_expr` for 
use in tests and downstream code.
   * Added targeted regression tests:
   
     * Unit tests verifying `PartitioningSatisfaction` outcomes (Exact / Subset 
/ NotSatisfied) match sanity checker behavior for hash partitioning.
     * A tokio regression test reproducing the chained-aggregate scenario from 
#18989 and asserting the optimized plan either inserts a repartition between 
aggregates or preserves a single-partition stream via 
`SortPreservingMergeExec`, and executes without panic.
   * Updated sqllogictest expected plans where distribution enforcement now 
inserts `RepartitionExec` and/or preserves partitioning through `SortExec`.
   
   ## Are these changes tested?
   
   Yes.
   
   * Added new unit tests in 
`datafusion/core/tests/physical_optimizer/enforce_distribution.rs` to validate 
distribution satisfaction behavior and its alignment with `SanityCheckPlan`.
   * Added an async regression test covering the multi-partition empty table + 
chained aggregations scenario from #18989.
   * Updated sqllogictest outputs to reflect the new physical plans.
   
   Performance sanity checks:
   
   * Ran the following benchmarks and verified no performance regressions:
   
     * `cargo bench  --bench topk_aggregate --profile=profiling -- 
--save-baseline before`
     * `cargo bench  --bench distinct_query_sql --profile=profiling -- 
--save-baseline before`
     * `cargo bench  --bench sort_preserving_merge --profile=profiling -- 
--save-baseline before`
   
   ## Are there any user-facing changes?
   
   Potentially.
   
   * Some queries may now include additional `RepartitionExec` operators (or 
preserve partitioning through sorts) to correctly satisfy downstream 
distribution requirements. This can change the displayed physical plan (EXPLAIN 
output) and may affect performance characteristics in edge cases, but fixes a 
correctness/stability issue (prevents optimizer-time / execution-time panics).
   * No SQL syntax or API behavior changes are intended.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to