hhhizzz opened a new issue, #23027:
URL: https://github.com/apache/datafusion/issues/23027

   ### Describe the bug
   
   ## Problem
   
   After `73e3c2a617` / #22646 (`chore: Add primary key constraints for TPC-H, 
TPC-DS`), TPC-DS q39 shows a large performance regression.
   
   The regression appears related to SQL aggregate planning with functional 
dependencies from primary key constraints.
   
   In q39, the query groups by key columns such as:
   
   - `item.i_item_sk`
   - `warehouse.w_warehouse_sk`
   
   After primary key constraints are present, the optimized aggregate plan 
expands the `GROUP BY` keys with many functionally dependent columns from those 
tables, even though the query does not need those columns after aggregation.
   
   Examples observed in the plan include:
   
   - `item.i_item_id`
   - `item.i_product_name`
   - `warehouse.w_gmt_offset`
   
   This makes the aggregate keys much wider and also causes extra columns to be 
projected from scans and carried through joins/aggregation.
   
   ## Regression Shape
   
   The regression pattern is:
   
   1. TPC-DS table schemas include primary key constraints.
   2. SQL planning recognizes functional dependencies from those constraints.
   3. Aggregate planning expands grouped primary key columns into dependent 
columns.
   4. The expansion includes columns that are not referenced by the query 
output.
   5. The plan carries much wider group keys than needed.
   6. q39 runtime increases substantially.
   
   This looks like a planner-level issue rather than a Parquet reader issue: 
disabling the TPC-DS primary key constraints makes q39 return to the previous 
timing range.
   
   ## Benchmark Results
   
   Environment:
   
   ```text
   TPC-DS SF10
   CPU: 24 Cores
   Rounds: 10
   Iterations: 1
   Parquet pushdown filters: true
   Parquet reorder filters: true
   Parquet pruning: true
   ```
   
   With TPC-DS primary key constraints enabled:
   
   ```text
   q39 current mean: ~8301 ms
   ```
   
   With TPC-DS primary key constraints disabled for diagnosis:
   
   ```text
   q39 current total: 14288.69 ms over 10 rounds
   q39 current mean:  ~1428.87 ms
   geomean current/main: 0.983399
   failures: 0
   ```
   
   So q39 is roughly:
   
   ```text
   ~8301 ms -> ~1429 ms
   ```
   
   when primary key constraints are removed from the TPC-DS schema setup.
   
   ## Expected Behavior
   
   Functional dependency support should allow queries to select columns 
determined by grouped keys, but aggregate planning should not add unreferenced 
functionally dependent columns to the physical/logical group keys.
   
   Only columns actually required after aggregation should need to appear in 
aggregate output/grouping.
   
   ### To Reproduce
   
   Run TPCDS q39 before and after the 
https://github.com/apache/datafusion/pull/22646
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to