kosiew opened a new pull request, #21472:
URL: https://github.com/apache/datafusion/pull/21472

   ## Which issue does this PR close?
   
   * Part of #20788
   
   ---
   
   ## Rationale for this change
   
   This PR introduces targeted regression coverage for a query pattern 
involving `unnest`, `row_number` window functions, and `array_agg` with `ORDER 
BY`. This pattern is known to trigger a different execution path in DataFusion 
compared to unordered `array_agg`, potentially leading to higher memory usage.
   
   Existing tests only covered the unordered variant of `array_agg`, which uses 
the optimized `GroupsAccumulator` fast path. However, when `ORDER BY` is 
introduced inside `array_agg`, DataFusion switches to an order-sensitive 
accumulation strategy that does not use this fast path.
   
   This change ensures that:
   
   * The execution plan shape for ordered `array_agg` is explicitly captured 
and validated.
   * The behavioral difference between ordered and unordered aggregation is 
documented and tested.
   * A stable baseline is established for future optimizations and regression 
detection.
   
   ---
   
   ## What changes are included in this PR?
   
   * Added a new Rust regression test 
(`ordered_array_agg_after_unnest_regression`) that:
   
     * Validates correctness of ordered `array_agg` after `unnest` and window 
functions.
     * Asserts expected physical plan structure, including:
   
       * `UnnestExec`
       * `BoundedWindowAggExec`
       * `SortExec`
       * `AggregateExec` in `SinglePartitioned` mode
     * Verifies that ordered aggregation avoids partial/final aggregation modes.
   
   * Added a comparison with the unordered variant to highlight execution 
differences:
   
     * Confirms use of `Partial` and `FinalPartitioned` aggregation for 
unordered case.
     * Demonstrates optimizer pruning of unused window expressions.
   
   * Added an `EXPLAIN ANALYZE` test 
(`ordered_array_agg_after_unnest_explain_analyze_metrics`) that:
   
     * Captures execution metrics such as output rows and memory usage.
     * Establishes a baseline for performance and memory tracking.
   
   * Extended `sqllogictest` coverage (`unnest.slt`) with:
   
     * A new `EXPLAIN` test showcasing the logical and physical plan for 
ordered `array_agg`.
     * A correctness test validating deterministic ordering within aggregated 
arrays.
     * Inline documentation explaining why unordered `array_agg` is 
insufficient coverage.
   
   * Introduced helper utilities in tests:
   
     * `ArrayAggOrder` enum to toggle ordered vs unordered queries.
     * SQL builder for consistent query generation.
     * Plan assertion helpers to validate execution structure.
   
   ---
   
   ## Are these changes tested?
   
   Yes.
   
   The PR includes multiple layers of test coverage:
   
   1. **Correctness Tests**
   
      * Validate that ordered `array_agg` preserves element ordering after 
`unnest`.
      * Use deterministic synthetic datasets.
   
   2. **Plan Shape Assertions**
   
      * Ensure the physical plan includes required operators for ordered 
aggregation.
      * Ensure forbidden operators (e.g., partial aggregation) are not present.
   
   3. **Explain Plan Tests (SQLLogicTest)**
   
      * Capture both logical and physical plans for regression tracking.
   
   4. **Execution Metrics Validation**
   
      * Use `EXPLAIN ANALYZE` to assert row counts and memory-related metrics.
   
   These tests collectively ensure both functional correctness and execution 
stability.
   
   ---
   
   ## Are there any user-facing changes?
   
   No.
   
   This PR only adds test coverage and documentation. There are no changes to 
public APIs or user-facing behavior.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to