alamb opened a new issue, #20325:
URL: https://github.com/apache/datafusion/issues/20325

   ### Is your feature request related to a problem or challenge?
   
   - Part of https://github.com/apache/datafusion/issues/20324
   
   When you run Q10 with predicate pushdown enabled (see #20324 for details of 
what that means) it goes more slowly:
   
   ```sql
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   ```
   
   Specifically, on my test I see it going 30% slower
   
   ```
   Benchmark clickbench_partitioned.json
   ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
   ┃ Query     ┃        HEAD ┃ alamb_pushdown_and_arrow_58 ┃        Change ┃
   ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
   │ QQuery 10 │   333.30 ms │                   447.25 ms │  1.34x slower │
   ```
   
   You can repro it like this:
   
   <details><summary>Repro Script</summary>
   <p>
   
   
   ```sql
   set datafusion.execution.parquet.binary_as_string = true; -- needed for 
ClickBench data
   SET datafusion.execution.target_partitions = 1; -- set to 1 to reduce 
variability
   create external table hits stored as parquet location 
'/home/ec2-user/datafusion/benchmarks/data/hits_partitioned';
   
   -- Q10 (default configuration, no pushdown)
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   
   -- Q10 enable with pushdown enabled
   SET datafusion.execution.parquet.pushdown_filters = true;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE 
"MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
   ```
   
   </p>
   </details> 
   
   
   When I run this with `datafusion-cli -f /tmp/q.sql  | grep Elapsed` you can 
clearly see 
   ```shell
   Elapsed 0.001 seconds.
   Elapsed 0.000 seconds.
   Elapsed 0.063 seconds.
   Elapsed 1.419 seconds. <-- First query run 
   Elapsed 1.432 seconds.
   Elapsed 1.387 seconds.
   Elapsed 1.399 seconds.
   Elapsed 1.388 seconds.
   Elapsed 0.000 seconds. <---- turn on filter pushdown
   Elapsed 1.690 seconds. <-- now the query runs 30% slower
   Elapsed 1.671 seconds.
   Elapsed 1.695 seconds.
   Elapsed 1.695 seconds.
   Elapsed 1.718 seconds.
   ```
   
   ### Describe the solution you'd like
   
   I want there to be no slowdown when I run Q10
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to