nuno-faria opened a new issue, #18982:
URL: https://github.com/apache/datafusion/issues/18982

   ### Describe the bug
   
   I think some ClickBench queries are not being correctly executed, as 
`EventDate` is being treated as a string.
   
   For example, in query 36:
   ```sql
   SELECT "URL", COUNT(*) AS PageViews
   FROM hits
   WHERE "CounterID" = 62
       AND "EventDate" >= '2013-07-01'
       AND "EventDate" <= '2013-07-31'
       AND "DontCountHits" = 0
       AND "IsRefresh" = 0
       AND "URL" <> ''
   GROUP BY "URL"
   ORDER BY PageViews DESC
   LIMIT 10;
   ```
   
   When executing it, `EventDate` (originally a `UInt16`) will be casted to 
`Utf8`:
   ```
   DataSourceExec: ...
     predicate=CounterID@1 = 62
       AND CAST(EventDate@0 AS Utf8) >= 2013-07-01
       AND CAST(EventDate@0 AS Utf8) <= 2013-07-31
       AND DontCountHits@4 = 0
       AND IsRefresh@3 = 0
       AND URL@2 != , ...
   ```
   
   This in turn makes the query return nothing. If we instead cast the 
`EventDate` first to date then it will work as expected:
   ```sql
   -- current
   SELECT "URL", COUNT(*) AS PageViews
   FROM hits
   WHERE "CounterID" = 62
       AND "EventDate" >= '2013-07-01'
       AND "EventDate" <= '2013-07-31'
       AND "DontCountHits" = 0
       AND "IsRefresh" = 0
       AND "URL" <> ''
   GROUP BY "URL"
   ORDER BY PageViews DESC
   LIMIT 10;
   
   +-----+-----------+
   | URL | pageviews |
   +-----+-----------+
   +-----+-----------+
   0 row(s) fetched. 
   Elapsed 0.120 seconds.
   
   -- with cast
   SELECT "URL", COUNT(*) AS PageViews
   FROM hits
   WHERE "CounterID" = 62
       AND "EventDate"::int::date >= '2013-07-01'
       AND "EventDate"::int::date <= '2013-07-31'
       AND "DontCountHits" = 0
       AND "IsRefresh" = 0
       AND "URL" <> ''
   GROUP BY "URL"
   ORDER BY PageViews DESC
   LIMIT 10;
   
   
+--------------------------------------------------------------------------+-----------+
   | URL                                                                      | 
pageviews |
   
+--------------------------------------------------------------------------+-----------+
   | http://irr.ru/index.php?showalbum/login-leniya7777294,938303130          | 
102341    |
   | http://komme%2F27.0.1453.116                                             | 
51218     |
   | http://irr.ru/index.php?showalbum/login-kapusta-advert2668]=0&order_by=0 | 
18315     |
   | http://irr.ru/index.php?showalbum/login-kapustic/product_name            | 
16461     |
   | http://irr.ru/index.php                                                  | 
12577     |
   | http://irr.ru/index.php?showalbum/login                                  | 
10880     |
   | http://komme%2F27.0.1453.116 Safari%2F5.0 (compatible; MSIE 9.0;         | 
7627      |
   | http://irr.ru/index.php?showalbum/login-kupalnik                         | 
4369      |
   | http://irr.ru/index.php?showalbum/login-kapusta-advert27256.html_params  | 
4058      |
   | http://komme%2F27.0.1453.116 Safari                                      | 
3021      |
   
+--------------------------------------------------------------------------+-----------+
   10 row(s) fetched. 
   Elapsed 0.189 seconds.
   ```
   
   I tracked it down to this issue 
https://github.com/apache/datafusion/issues/15509 and this PR 
https://github.com/apache/datafusion/pull/15574, where it has been removed by 
mistake and the existing tests were not able to caught it. It includes queries 
36 to 42. I think the confusion came from the fact that DuckDB also does not do 
the cast when executing the query, [but it casts when creating the 
view](https://github.com/ClickHouse/ClickBench/blob/d876e9bcf30f09cd5df0e025722db1e95374aa1c/duckdb-parquet/create.sql#L1-L4):
   
   ```sql
   CREATE VIEW hits AS
   SELECT *
       REPLACE (make_date(EventDate) AS EventDate)
   FROM read_parquet('hits.parquet', binary_as_string=True);
   ```
   
   ### To Reproduce
   
   ClickBench.
   
   ### Expected behavior
   
   Run data in queries 36-42.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to