adriangb commented on PR #19483:
URL: https://github.com/apache/datafusion/pull/19483#issuecomment-3692158283

   I've added a property based test that asserts the property that results 
should be the same after unparsing and re-parsing a query given the same input 
data*. I think this is a good test because:
   - It uses real world queries and data
   - It's a property based test on the thing users care about in general 
(correct results) instead of e.g. asserting the unparsed SQL matches some shape
   
   *: Not all queries have a deterministic sort order. I check if the 
*original* query has a known output ordering and if it doesn't I sort both 
outputs.
   
   These tests show that without these fixes there are two issues for 
ClickBench queries:
   - Column name quoting is missing for columns with uppercase letters
   - The ORDER BY bug
   
   Here is the failure output (also relevant to judge since the tests are being 
added):
   
   ```
   3 Clickbench test(s) failed:
   
   Results mismatch for q15.
   Original SQL:
   -- Must set for ClickBench hits_partitioned dataset. See 
https://github.com/apache/datafusion/issues/16591
   -- set datafusion.execution.parquet.binary_as_string = true
   
   SELECT "UserID", COUNT(*) FROM hits GROUP BY "UserID" ORDER BY COUNT(*) DESC 
LIMIT 10;
   
   Unparsed SQL:
   SELECT
     hits."UserID",
     "count(*)"
   FROM
     (
       SELECT
         hits."UserID",
         count(1) AS "count(*)",
         count(1)
       FROM
         hits
       GROUP BY
         hits."UserID" ORDER BY count(1) DESC NULLS FIRST
     ) LIMIT 10
   
   ---
   
   Results mismatch for q16.
   Original SQL:
   -- Must set for ClickBench hits_partitioned dataset. See 
https://github.com/apache/datafusion/issues/16591
   -- set datafusion.execution.parquet.binary_as_string = true
   
   SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
   
   Unparsed SQL:
   SELECT
     hits."UserID",
     hits."SearchPhrase",
     "count(*)"
   FROM
     (
       SELECT
         hits."UserID",
         hits."SearchPhrase",
         count(1) AS "count(*)",
         count(1)
       FROM
         hits
       GROUP BY
         hits."UserID", hits."SearchPhrase" ORDER BY count(1) DESC NULLS FIRST
     ) LIMIT 10
   
   ---
   
   Results mismatch for q18.
   Original SQL:
   -- Must set for ClickBench hits_partitioned dataset. See 
https://github.com/apache/datafusion/issues/16591
   -- set datafusion.execution.parquet.binary_as_string = true
   
   SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS 
m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" 
ORDER BY COUNT(*) DESC LIMIT 10;
   
   Unparsed SQL:
   SELECT
     hits."UserID",
     m,
     hits."SearchPhrase",
     "count(*)"
   FROM
     (
       SELECT
         hits."UserID",
         date_part('MINUTE', to_timestamp_seconds(hits."EventTime")) AS m,
         hits."SearchPhrase",
         count(1) AS "count(*)",
         count(1)
       FROM
         hits
       GROUP BY
         hits."UserID", date_part('MINUTE', 
to_timestamp_seconds(hits."EventTime")), hits."SearchPhrase" ORDER BY count(1) 
DESC NULLS FIRST
     ) LIMIT 10
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to