andygrove commented on PR #22474: URL: https://github.com/apache/datafusion/pull/22474#issuecomment-4565474441
Hi @Omega359 — I ran an LLM-assisted comparison of the updated queries in this PR against the official TPC-DS 4.0.0 reference queries (`DSGen-software-code-4.0.0/generated_queries/queryN.sql`). The methodology was: normalize whitespace, casing, comments, interval-vs-integer date arithmetic, and all string/numeric literals (since `dsqgen` produces different literal values per instance), then flag remaining semantic differences. Flagging that this was done with LLM assistance — please verify any finding before acting on it. Out of 99 queries: **80 match the spec semantically. 19 diverge; 10 of those look high-risk** (wrong column in an aggregate, wrong aggregate function, missing/extra projection column, or different ORDER BY key). The remaining 9 are DataFusion-specific workarounds that look acceptable but are noted at the bottom for completeness. ### High-risk findings (wrong column / wrong aggregate vs spec) | Q | PR | Spec | |---|----|------| | 1 | `sum(sr_return_amt)` | `sum(sr_fee)` | | 4 | final SELECT/ORDER BY uses `customer_preferred_cust_flag` | uses `customer_email_address` | | 9 | `avg(ss_net_paid)` (×5 CASE branches) | `avg(ss_net_profit)` | | 11 | same as Q4 | | | 14 | second statement uses `SELECT *` over join with `_2`-suffixed aliases | explicit `ty_*`/`ly_*` aliased projection | | 24 | `sum(ss_net_paid) netpaid` (×2) | `sum(ss_sales_price) netpaid` | | 35 | per group: `min(x), max(x), avg(x)` | `avg(x), max(x), sum(x)` (with `aggone/two/three` aliases) | | 45 | `ca_city` in projection, GROUP BY, ORDER BY | `ca_county` | | 47 | v2 CTE adds `i_category, i_brand`; ORDER BY positional `3` | smaller v2; ORDER BY `nsum` by name | | 66 | web aggregates over `ws_ext_sales_price`/`ws_net_paid`; catalog over `cs_sales_price`/`cs_net_paid_inc_tax` (×48 expressions) | web: `ws_sales_price`/`ws_net_paid_inc_tax`; catalog: `cs_ext_sales_price`/`cs_net_paid_inc_ship_tax`. Looks like the web/catalog column families got swapped. | | 74 | `sum(*_net_paid) year_total` (×2) | `max(*_net_paid) year_total` | ### Medium risk — ORDER BY tie-breaker dropped - **Q6**: `ORDER BY cnt` (spec: `cnt, a.ca_state`) - **Q31**: `ORDER BY ss1.ca_county` (spec: `ss1.d_year`) - **Q56**: `ORDER BY total_sales` (spec: `total_sales, i_item_id`) - **Q75**: `ORDER BY sales_cnt_diff` (spec: `sales_cnt_diff, sales_amt_diff`) - **Q57**: v2 CTE has extra `i_category, i_brand` that leak through `SELECT *` ### DataFusion-specific workarounds (noted for completeness, look acceptable) - **Q39, Q64**: `… AS x_2` aliases added to dodge duplicate output column names - **Q90**: subquery alias `at` → `at1` (reserved-word workaround) - **Q2**: alias `x` added to inner UNION subquery (parser requirement) Q44 and Q49 (the two the PR description calls out as spec-conformance fixes) verified clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
