alamb opened a new issue, #20324: URL: https://github.com/apache/datafusion/issues/20324
[EPIC] No performance regressions when enabling parquet filter pushdown (late materialization) # TLDR This epic tracks the known work needed to enable Parquet "filter pushdown" (late materialization) by default: - https://github.com/apache/datafusion/issues/3463 When Parquet filter pushdown is turned on, some queries get (much) faster, but some queries get slower. We currently assume that we need to ensure we don't have any performance regressions when we enable it by default. If you would like to help, the best thing you can do is to investigate the queries that are slower when we turn on pushdown and figure out how to make them faster. The likely involves profiling ([docs here](https://datafusion.apache.org/library-user-guide/profiling.html) ) and then making code changes to fix the bottlenecks. I suspect many of the changes will be in the Parquet reader (in [arrow-rs](https://github.com/apache/arrow-rs)) ## Leaderboard as of Feb 12. 2025 Highlighted improvements (the large gains are due to being able to evaluate dynamic filters during the scan phase) * Benchmark clickbench_extended.json ``` │ QQuery 6 │ 3977.38 ms │ 155.90 ms │ +25.51x faster │ ``` * Benchmark clickbench_partitioned.json ``` │ QQuery 23 │ 12166.40 ms │ 1289.23 ms │ +9.44x faster │ ``` Here is the current results with queries that are slower when we turn on pushdown, Benchmark clickbench_partitioned.json ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ HEAD ┃ alamb_pushdown_and_arrow_58 ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 2 │ 134.10 ms │ 152.95 ms │ 1.14x slower │ │ QQuery 6 │ 6.54 ms │ 21.47 ms │ 3.28x slower │ │ QQuery 7 │ 55.50 ms │ 65.21 ms │ 1.17x slower │ │ QQuery 10 │ 333.30 ms │ 447.25 ms │ 1.34x slower │ │ QQuery 11 │ 392.77 ms │ 494.71 ms │ 1.26x slower │ │ QQuery 12 │ 1193.25 ms │ 1417.78 ms │ 1.19x slower │ │ QQuery 13 │ 1901.29 ms │ 2056.60 ms │ 1.08x slower │ │ QQuery 14 │ 1223.87 ms │ 1420.63 ms │ 1.16x slower │ │ QQuery 19 │ 121.77 ms │ 136.65 ms │ 1.12x slower │ │ QQuery 24 │ 216.43 ms │ 295.62 ms │ 1.37x slower │ │ QQuery 25 │ 453.43 ms │ 605.15 ms │ 1.33x slower │ │ QQuery 26 │ 211.98 ms │ 328.95 ms │ 1.55x slower │ │ QQuery 27 │ 2644.08 ms │ 2909.34 ms │ 1.10x slower │ │ QQuery 28 │ 23267.40 ms │ 24763.39 ms │ 1.06x slower │ │ QQuery 37 │ 79.13 ms │ 87.01 ms │ 1.10x slower │ │ QQuery 40 │ 40.20 ms │ 56.08 ms │ 1.40x slower │ │ QQuery 41 │ 34.76 ms │ 44.99 ms │ 1.29x slower │ │ QQuery 42 │ 31.70 ms │ 34.01 ms │ 1.07x slower │ └───────────┴─────────────┴─────────────────────────────┴───────────────┘ Full details for recent performance results I have when turning this on, see https://github.com/apache/datafusion/pull/20318: <details><summary>Entire Results Benchmark </summary> <p> https://github.com/apache/datafusion/pull/20318#issuecomment-3891569822 ``` Comparing HEAD and alamb_pushdown_and_arrow_58 -------------------- Benchmark clickbench_extended.json -------------------- ┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓ ┃ Query ┃ HEAD ┃ alamb_pushdown_and_arrow_58 ┃ Change ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 2424.01 ms │ 2264.27 ms │ +1.07x faster │ │ QQuery 1 │ 896.58 ms │ 890.50 ms │ no change │ │ QQuery 2 │ 1727.03 ms │ 1713.88 ms │ no change │ │ QQuery 3 │ 1016.56 ms │ 1040.16 ms │ no change │ │ QQuery 4 │ 2196.14 ms │ 2198.21 ms │ no change │ │ QQuery 5 │ 28540.31 ms │ 26699.31 ms │ +1.07x faster │ │ QQuery 6 │ 3977.38 ms │ 155.90 ms │ +25.51x faster │ │ QQuery 7 │ 2690.66 ms │ 2795.88 ms │ no change │ └──────────┴─────────────┴─────────────────────────────┴────────────────┘ ``` https://github.com/apache/datafusion/pull/20318#issuecomment-3891438188 ``` -------------------- Benchmark clickbench_partitioned.json -------------------- ┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ HEAD ┃ alamb_pushdown_and_arrow_58 ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 0 │ 2.54 ms │ 2.59 ms │ no change │ │ QQuery 1 │ 49.67 ms │ 49.78 ms │ no change │ │ QQuery 2 │ 134.10 ms │ 152.95 ms │ 1.14x slower │ │ QQuery 3 │ 155.24 ms │ 156.89 ms │ no change │ │ QQuery 4 │ 1002.02 ms │ 1022.11 ms │ no change │ │ QQuery 5 │ 1279.70 ms │ 1276.97 ms │ no change │ │ QQuery 6 │ 6.54 ms │ 21.47 ms │ 3.28x slower │ │ QQuery 7 │ 55.50 ms │ 65.21 ms │ 1.17x slower │ │ QQuery 8 │ 1361.66 ms │ 1418.47 ms │ no change │ │ QQuery 9 │ 1784.07 ms │ 1815.62 ms │ no change │ │ QQuery 10 │ 333.30 ms │ 447.25 ms │ 1.34x slower n│ │ QQuery 11 │ 392.77 ms │ 494.71 ms │ 1.26x slower │ │ QQuery 12 │ 1193.25 ms │ 1417.78 ms │ 1.19x slower │ │ QQuery 13 │ 1901.29 ms │ 2056.60 ms │ 1.08x slower │ │ QQuery 14 │ 1223.87 ms │ 1420.63 ms │ 1.16x slower │ │ QQuery 15 │ 1146.25 ms │ 1193.73 ms │ no change │ │ QQuery 16 │ 2462.82 ms │ 2471.52 ms │ no change │ │ QQuery 17 │ 2466.41 ms │ 2369.56 ms │ no change │ │ QQuery 18 │ 4990.28 ms │ 4544.09 ms │ +1.10x faster │ │ QQuery 19 │ 121.77 ms │ 136.65 ms │ 1.12x slower │ │ QQuery 20 │ 1924.82 ms │ 1840.60 ms │ no change │ │ QQuery 21 │ 2206.94 ms │ 2276.71 ms │ no change │ │ QQuery 22 │ 3699.15 ms │ 3776.51 ms │ no change │ │ QQuery 23 │ 12166.40 ms │ 1289.23 ms │ +9.44x faster │ │ QQuery 24 │ 216.43 ms │ 295.62 ms │ 1.37x slower │ │ QQuery 25 │ 453.43 ms │ 605.15 ms │ 1.33x slower │ │ QQuery 26 │ 211.98 ms │ 328.95 ms │ 1.55x slower │ │ QQuery 27 │ 2644.08 ms │ 2909.34 ms │ 1.10x slower │ │ QQuery 28 │ 23267.40 ms │ 24763.39 ms │ 1.06x slower │ │ QQuery 29 │ 962.53 ms │ 964.69 ms │ no change │ │ QQuery 30 │ 1262.47 ms │ 1253.99 ms │ no change │ │ QQuery 31 │ 1315.46 ms │ 1273.97 ms │ no change │ │ QQuery 32 │ 4319.41 ms │ 3864.61 ms │ +1.12x faster │ │ QQuery 33 │ 5401.20 ms │ 4915.16 ms │ +1.10x faster │ │ QQuery 34 │ 5862.38 ms │ 5446.49 ms │ +1.08x faster │ │ QQuery 35 │ 1908.26 ms │ 1879.87 ms │ no change │ │ QQuery 36 │ 189.23 ms │ 185.66 ms │ no change │ │ QQuery 37 │ 79.13 ms │ 87.01 ms │ 1.10x slower │ │ QQuery 38 │ 115.09 ms │ 91.86 ms │ +1.25x faster │ │ QQuery 39 │ 340.85 ms │ 286.45 ms │ +1.19x faster │ │ QQuery 40 │ 40.20 ms │ 56.08 ms │ 1.40x slower │ │ QQuery 41 │ 34.76 ms │ 44.99 ms │ 1.29x slower │ │ QQuery 42 │ 31.70 ms │ 34.01 ms │ 1.07x slower │ └───────────┴─────────────┴─────────────────────────────┴───────────────┘ ``` https://github.com/apache/datafusion/pull/20318#issuecomment-3891611527 ``` Comparing HEAD and alamb_pushdown_and_arrow_58 -------------------- Benchmark tpch_mem_sf1.json -------------------- ┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ HEAD ┃ alamb_pushdown_and_arrow_58 ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 106.36 ms │ 101.53 ms │ no change │ │ QQuery 2 │ 33.05 ms │ 31.05 ms │ +1.06x faster │ │ QQuery 3 │ 37.27 ms │ 35.90 ms │ no change │ │ QQuery 4 │ 32.24 ms │ 31.09 ms │ no change │ │ QQuery 5 │ 93.75 ms │ 81.95 ms │ +1.14x faster │ │ QQuery 6 │ 21.07 ms │ 20.25 ms │ no change │ │ QQuery 7 │ 159.62 ms │ 146.90 ms │ +1.09x faster │ │ QQuery 8 │ 42.98 ms │ 39.57 ms │ +1.09x faster │ │ QQuery 9 │ 111.61 ms │ 100.78 ms │ +1.11x faster │ │ QQuery 10 │ 67.99 ms │ 63.06 ms │ +1.08x faster │ │ QQuery 11 │ 18.50 ms │ 18.94 ms │ no change │ │ QQuery 12 │ 52.30 ms │ 54.00 ms │ no change │ │ QQuery 13 │ 49.77 ms │ 47.68 ms │ no change │ │ QQuery 14 │ 14.97 ms │ 14.12 ms │ +1.06x faster │ │ QQuery 15 │ 30.49 ms │ 29.21 ms │ no change │ │ QQuery 16 │ 28.87 ms │ 27.64 ms │ no change │ │ QQuery 17 │ 145.35 ms │ 143.02 ms │ no change │ │ QQuery 18 │ 289.85 ms │ 269.10 ms │ +1.08x faster │ │ QQuery 19 │ 40.12 ms │ 49.08 ms │ 1.22x slower │ │ QQuery 20 │ 56.72 ms │ 57.28 ms │ no change │ │ QQuery 21 │ 191.17 ms │ 189.19 ms │ no change │ │ QQuery 22 │ 22.55 ms │ 22.26 ms │ no change │ └───────────┴───────────┴─────────────────────────────┴───────────────┘ ``` </p> </details> ## How to run the benchmarks yourself You can find the query text for the ClickBench queries here - ClickBench queries: https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench/queries - ClickBench (extended) benchmarks: https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench/extended Download the data like this: ```bash cd datafusion ./benchmarks/bench.sh data clickbench_partitioned ``` Then run the queries like this (adjusting the data location): ```sql set datafusion.execution.parquet.binary_as_string = true; -- needed for ClickBench data SET datafusion.execution.target_partitions = 1; -- set to 1 to reduce variability create external table hits stored as parquet location '/home/ec2-user/datafusion/benchmarks/data/hits_partitioned'; QUERY_TEXT_HERE; SET datafusion.execution.parquet.pushdown_filters = true; QUERY_TEXT_HERE; ``` For example, to run Q10 on the partitioned ClickBench data: ```bash set datafusion.execution.parquet.binary_as_string = true; -- needed for ClickBench data SET datafusion.execution.target_partitions = 1; -- set to 1 to reduce variability create external table hits stored as parquet location '/home/ec2-user/datafusion/benchmarks/data/hits_partitioned'; -- Q10 (default configuration, no pushdown) SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; -- Q10 enable with pushdown enabled SET datafusion.execution.parquet.pushdown_filters = true; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10; ``` # Background: ## What is "Parquet filter pushdown"? In this context, "Parquet filter pushdown" refers to the ability of the query engine to push down filter predicates (e.g., WHERE clauses) **into** the Parquet file reader, allowing it to skip reading irrelevant data and thus improving query performance. As has been pointed out, this is somewhat confusing as parquet reader already implements many forms of filter pushdown (e.g. to prune files. row groups and pages) It is enabled in datafusion by setting the configuration option `datafusion.execution.parquet.pushdown_filters` to true. See [the configuration docs](https://datafusion.apache.org/user-guide/configs.html): | name | default | description | |--------|--------|--------| | datafusion.execution.parquet.pushdown_filters | false | Cell | > (reading) If true, filter expressions are be applied during the parquet decoding operation to reduce the number of rows decoded. This optimization is sometimes called “late materialization”. > > This refers to the ability of the query engine to push down filter predicates (e.g., WHERE clauses) **into** the Parquet file reader, allowing it to skip reading irrelevant data and thus improving query performance. When this setting is on, DataFusion gives the predicates to the Parquet reader using the [`ArrowPredicate`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html) API. See the implementation in DataFusion here: https://github.com/apache/datafusion/blob/aae3e0f4a1f0b6ce92d4d9a35e1814aad011f687/datafusion/datasource-parquet/src/row_filter.rs#L105 # Specific query investigations: - [] TODO file Q10 (or 21?) investigation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
