ydgandhi commented on PR #21088: URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4115356321
Hi! @xiedeyantu, great feedback. Let me explain my thought process for some of the design choices regarding the tests. See inline > Thank you for the detailed test! But I have a few questions. > > 1. Why add an extra > "SUM" outside? The outer `SELECT SUM(...) FROM ( ... GROUP BY ... )` is a **harness choice**: - The work under measurement is the **inner** grouped aggregate (multiple `COUNT(DISTINCT …)`). - The outer `SUM` collapses per-group results to **one row**, which avoids dumping thousands of rows when comparing wall time / peak RSS. - To benchmark **only** the inner shape, run the subquery alone (or use `SELECT COUNT(*) FROM (inner) t` if you want a scalar sink without changing the inner aggregates). > 2. I think it’s better for > "CAST" not to involve expression calculations. Agreed for a clean test: both distincts should use `CAST` on a column only. Our Scenario 2 used `CAST(l_extendedprice AS INT)` (column-only) and `CAST(l_discount * 100 AS INT)` (expression inside the cast). We'll switch the second to something like `CAST(l_discount AS INT)` (or another column-only cast) so we're not mixing in arithmetic inside `CAST`, and so it lines up with the rule's conservative “safe” distinct arguments. > 3. It seems the third case is a low-cardinality test — could you show a high-cardinality scenario? The `lower()` case used `GROUP BY l_returnflag`, which is only a few groups. For a high-cardinality variant we'll use the same select list but a denser key, e.g. `GROUP BY l_suppkey` (or `l_orderkey`), so the join after the rewrite has many more keys — still with `COUNT(DISTINCT lower(l_shipmode))` and `COUNT(DISTINCT l_shipinstruct)`. We'll follow up with updated query text and numbers when we rerun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
