ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4115356321

   Hi! @xiedeyantu, great feedback. Let me explain my thought process for some 
of the design choices regarding the tests. See inline 
   
   > Thank you for the detailed test! But I have a few questions.
   > 
   > 1. Why add an extra
   >    "SUM" outside?
   
   The outer `SELECT SUM(...) FROM ( ... GROUP BY ... )` is a **harness 
choice**:
   
   - The work under measurement is the **inner** grouped aggregate (multiple 
`COUNT(DISTINCT …)`).
   - The outer `SUM` collapses per-group results to **one row**, which avoids 
dumping thousands of rows when comparing wall time / peak RSS.
   - To benchmark **only** the inner shape, run the subquery alone (or use 
`SELECT COUNT(*) FROM (inner) t` if you want a scalar sink without changing the 
inner aggregates).
   
   > 2. I think it’s better for
   >    "CAST" not to involve expression calculations.
   
   Agreed for a clean test: both distincts should use `CAST` on a column only. 
Our Scenario 2 used `CAST(l_extendedprice AS INT)` (column-only) and 
`CAST(l_discount * 100 AS INT)` (expression inside the cast). We'll switch the 
second to something like `CAST(l_discount AS INT)` (or another column-only 
cast) so we're not mixing in arithmetic inside `CAST`, and so it lines up with 
the rule's conservative “safe” distinct arguments.
   
   > 3. It seems the third case is a low-cardinality test — could you show a 
high-cardinality scenario?
   
   The `lower()` case used `GROUP BY l_returnflag`, which is only a few groups. 
For a high-cardinality variant we'll use the same select list but a denser key, 
e.g. `GROUP BY l_suppkey` (or `l_orderkey`), so the join after the rewrite has 
many more keys — still with `COUNT(DISTINCT lower(l_shipmode))` and 
`COUNT(DISTINCT l_shipinstruct)`.
   
   We'll follow up with updated query text and numbers when we rerun.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to