andygrove commented on PR #22474:
URL: https://github.com/apache/datafusion/pull/22474#issuecomment-4565474441

   Hi @Omega359 — I ran an LLM-assisted comparison of the updated queries in 
this PR against the official TPC-DS 4.0.0 reference queries 
(`DSGen-software-code-4.0.0/generated_queries/queryN.sql`). The methodology 
was: normalize whitespace, casing, comments, interval-vs-integer date 
arithmetic, and all string/numeric literals (since `dsqgen` produces different 
literal values per instance), then flag remaining semantic differences. 
Flagging that this was done with LLM assistance — please verify any finding 
before acting on it.
   
   Out of 99 queries: **80 match the spec semantically. 19 diverge; 10 of those 
look high-risk** (wrong column in an aggregate, wrong aggregate function, 
missing/extra projection column, or different ORDER BY key). The remaining 9 
are DataFusion-specific workarounds that look acceptable but are noted at the 
bottom for completeness.
   
   ### High-risk findings (wrong column / wrong aggregate vs spec)
   
   | Q | PR | Spec |
   |---|----|------|
   | 1 | `sum(sr_return_amt)` | `sum(sr_fee)` |
   | 4 | final SELECT/ORDER BY uses `customer_preferred_cust_flag` | uses 
`customer_email_address` |
   | 9 | `avg(ss_net_paid)` (×5 CASE branches) | `avg(ss_net_profit)` |
   | 11 | same as Q4 | |
   | 14 | second statement uses `SELECT *` over join with `_2`-suffixed aliases 
| explicit `ty_*`/`ly_*` aliased projection |
   | 24 | `sum(ss_net_paid) netpaid` (×2) | `sum(ss_sales_price) netpaid` |
   | 35 | per group: `min(x), max(x), avg(x)` | `avg(x), max(x), sum(x)` (with 
`aggone/two/three` aliases) |
   | 45 | `ca_city` in projection, GROUP BY, ORDER BY | `ca_county` |
   | 47 | v2 CTE adds `i_category, i_brand`; ORDER BY positional `3` | smaller 
v2; ORDER BY `nsum` by name |
   | 66 | web aggregates over `ws_ext_sales_price`/`ws_net_paid`; catalog over 
`cs_sales_price`/`cs_net_paid_inc_tax` (×48 expressions) | web: 
`ws_sales_price`/`ws_net_paid_inc_tax`; catalog: 
`cs_ext_sales_price`/`cs_net_paid_inc_ship_tax`. Looks like the web/catalog 
column families got swapped. |
   | 74 | `sum(*_net_paid) year_total` (×2) | `max(*_net_paid) year_total` |
   
   ### Medium risk — ORDER BY tie-breaker dropped
   
   - **Q6**: `ORDER BY cnt` (spec: `cnt, a.ca_state`)
   - **Q31**: `ORDER BY ss1.ca_county` (spec: `ss1.d_year`)
   - **Q56**: `ORDER BY total_sales` (spec: `total_sales, i_item_id`)
   - **Q75**: `ORDER BY sales_cnt_diff` (spec: `sales_cnt_diff, sales_amt_diff`)
   - **Q57**: v2 CTE has extra `i_category, i_brand` that leak through `SELECT 
*`
   
   ### DataFusion-specific workarounds (noted for completeness, look acceptable)
   
   - **Q39, Q64**: `… AS x_2` aliases added to dodge duplicate output column 
names
   - **Q90**: subquery alias `at` → `at1` (reserved-word workaround)
   - **Q2**: alias `x` added to inner UNION subquery (parser requirement)
   
   Q44 and Q49 (the two the PR description calls out as spec-conformance fixes) 
verified clean.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to