Github user pwoody commented on the issue:

    https://github.com/apache/spark/pull/15835
  
    I've updated the structure of the PR to change caching to be global across 
instances of FileFormat, have expiry, and reuse known filters. Here is a new 
benchmark to highlight the filter re-use (I flipped the results to make it 
easier to read)
    
    ```
        withSQLConf(ParquetOutputFormat.ENABLE_JOB_SUMMARY -> "true") {
          withTempPath { path =>
            spark.range(0, 200, 1, 200)
              .write.parquet(path.getCanonicalPath)
            val benchmark = new Benchmark("Parquet partition pruning 
benchmark", 200)
            benchmark.addCase("Parquet partition pruning disabled") { iter =>
              withSQLConf(SQLConf.PARQUET_PARTITION_PRUNING_ENABLED.key -> 
"false") {
                var df = spark.read.parquet(path.getCanonicalPath).filter("id = 
0")
                for (i <- 1 to 10) {
                  df = df.filter(s"id < $i")
                  df.collect()
                }
              }
            }
            benchmark.addCase("Parquet partition pruning enabled") { iter =>
              var df = spark.read.parquet(path.getCanonicalPath).filter("id = 
0")
              for (i <- 1 to 10) {
                df = df.filter(s"id < $i")
                df.collect()
              }
            }
            benchmark.run()
          }
        }
    ```
    
    
    ```
    Running benchmark: Parquet partition pruning benchmark
      Running case: Parquet partition pruning disabled
      Stopped after 2 iterations, 8744 ms
      Running case: Parquet partition pruning enabled
      Stopped after 5 iterations, 2187 ms
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26 on Mac OS X 10.10.5
    Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz
    
    Parquet partition pruning benchmark:     Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Parquet partition pruning disabled            4332 / 4372          0.0    
21659450.2       1.0X
    Parquet partition pruning enabled              399 /  438          0.0     
1995877.1      10.9X
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to