Github user pwoody commented on the issue: https://github.com/apache/spark/pull/15835 I've updated the structure of the PR to change caching to be global across instances of FileFormat, have expiry, and reuse known filters. Here is a new benchmark to highlight the filter re-use (I flipped the results to make it easier to read) ``` withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") { withTempPath { path => spark.range(0, 200, 1, 200) .write.parquet(path.getCanonicalPath) val benchmark = new Benchmark("Parquet partition pruning benchmark", 200) benchmark.addCase("Parquet partition pruning disabled") { iter => withSQLConf(SQLConf.PARQUET_PARTITION_PRUNING_ENABLED.key -> "false") { var df = spark.read.parquet(path.getCanonicalPath).filter("id = 0") for (i <- 1 to 10) { df = df.filter(s"id < $i") df.collect() } } } benchmark.addCase("Parquet partition pruning enabled") { iter => var df = spark.read.parquet(path.getCanonicalPath).filter("id = 0") for (i <- 1 to 10) { df = df.filter(s"id < $i") df.collect() } } benchmark.run() } } ``` ``` Running benchmark: Parquet partition pruning benchmark Running case: Parquet partition pruning disabled Stopped after 2 iterations, 8744 ms Running case: Parquet partition pruning enabled Stopped after 5 iterations, 2187 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 on Mac OS X 10.10.5 Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz Parquet partition pruning benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Parquet partition pruning disabled 4332 / 4372 0.0 21659450.2 1.0X Parquet partition pruning enabled 399 / 438 0.0 1995877.1 10.9X ```
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org