alamb opened a new issue, #20325: URL: https://github.com/apache/datafusion/issues/20325
### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/issues/20324 When you run Q10 with predicate pushdown enabled (see #20324 for details of what that means) it goes more slowly: ```sql SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; ``` Specifically, on my test I see it going 30% slower ``` Benchmark clickbench_partitioned.json ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ HEAD ┃ alamb_pushdown_and_arrow_58 ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 10 │ 333.30 ms │ 447.25 ms │ 1.34x slower │ ``` You can repro it like this: <details><summary>Repro Script</summary> <p> ```sql set datafusion.execution.parquet.binary_as_string = true; -- needed for ClickBench data SET datafusion.execution.target_partitions = 1; -- set to 1 to reduce variability create external table hits stored as parquet location '/home/ec2-user/datafusion/benchmarks/data/hits_partitioned'; -- Q10 (default configuration, no pushdown) SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; -- Q10 enable with pushdown enabled SET datafusion.execution.parquet.pushdown_filters = true; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; ``` </p> </details> When I run this with `datafusion-cli -f /tmp/q.sql | grep Elapsed` you can clearly see ```shell Elapsed 0.001 seconds. Elapsed 0.000 seconds. Elapsed 0.063 seconds. Elapsed 1.419 seconds. <-- First query run Elapsed 1.432 seconds. Elapsed 1.387 seconds. Elapsed 1.399 seconds. Elapsed 1.388 seconds. Elapsed 0.000 seconds. <---- turn on filter pushdown Elapsed 1.690 seconds. <-- now the query runs 30% slower Elapsed 1.671 seconds. Elapsed 1.695 seconds. Elapsed 1.695 seconds. Elapsed 1.718 seconds. ``` ### Describe the solution you'd like I want there to be no slowdown when I run Q10 ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
