adriangb commented on PR #19483: URL: https://github.com/apache/datafusion/pull/19483#issuecomment-3692158283
I've added a property based test that asserts the property that results should be the same after unparsing and re-parsing a query given the same input data*. I think this is a good test because: - It uses real world queries and data - It's a property based test on the thing users care about in general (correct results) instead of e.g. asserting the unparsed SQL matches some shape *: Not all queries have a deterministic sort order. I check if the *original* query has a known output ordering and if it doesn't I sort both outputs. These tests show that without these fixes there are two issues for ClickBench queries: - Column name quoting is missing for columns with uppercase letters - The ORDER BY bug Here is the failure output (also relevant to judge since the tests are being added): ``` 3 Clickbench test(s) failed: Results mismatch for q15. Original SQL: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591 -- set datafusion.execution.parquet.binary_as_string = true SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC LIMIT 10; Unparsed SQL: SELECT hits."UserID", "count(*)" FROM ( SELECT hits."UserID", count(1) AS "count(*)", count(1) FROM hits GROUP BY hits."UserID" ORDER BY count(1) DESC NULLS FIRST ) LIMIT 10 --- Results mismatch for q16. Original SQL: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591 -- set datafusion.execution.parquet.binary_as_string = true SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; Unparsed SQL: SELECT hits."UserID", hits."SearchPhrase", "count(*)" FROM ( SELECT hits."UserID", hits."SearchPhrase", count(1) AS "count(*)", count(1) FROM hits GROUP BY hits."UserID", hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST ) LIMIT 10 --- Results mismatch for q18. Original SQL: -- Must set for ClickBench hits_partitioned dataset. See https://github.com/apache/datafusion/issues/16591 -- set datafusion.execution.parquet.binary_as_string = true SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10; Unparsed SQL: SELECT hits."UserID", m, hits."SearchPhrase", "count(*)" FROM ( SELECT hits."UserID", date_part('MINUTE', to_timestamp_seconds(hits."EventTime")) AS m, hits."SearchPhrase", count(1) AS "count(*)", count(1) FROM hits GROUP BY hits."UserID", date_part('MINUTE', to_timestamp_seconds(hits."EventTime")), hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST ) LIMIT 10 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
