GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/13701

     [SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet 
reader

    ## What changes were proposed in this pull request?
    
    The base class `SpecificParquetRecordReaderBase` used for vectorized 
parquet reader will try to get pushed-down filters from the given 
configuration. This pushed-down filters are used for RowGroups-level filtering. 
However, we don't set up the filters to push down into the configuration. In 
other words, the filters are not actually pushed down to do RowGroups-level 
filtering. This patch is to fix this and tries to set up the filters for 
pushing down to configuration for the reader.
    
    The benchmark that excludes the time of writing Parquet file:
    
        test("Benchmark for Parquet") {
          val N = 1 << 50
            withParquetTable((0 until N).map(i => (101, i)), "t") {
              val benchmark = new Benchmark("Parquet reader", N)
              benchmark.addCase("reading Parquet file", 10) { iter =>
                sql("SELECT _1 FROM t where t._1 < 100").collect()
              }
              benchmark.run()
          }
        }
    
    `withParquetTable` in default will run tests for vectorized reader 
non-vectorized readers. I only let it run vectorized reader.
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 
3.13.0-57-generic
        Westmere E56xx/L56xx/X56xx (Nehalem-C)
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        reading Parquet file                            76 /   88          3.4  
       291.0       1.0X
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux 
3.13.0-57-generic
        Westmere E56xx/L56xx/X56xx (Nehalem-C)
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        reading Parquet file                            81 /   91          3.2  
       310.2       1.0X
    
    Next, I run the benchmark for non-pushdown case using the same benchmark 
code but with disabled pushdown configuration.
    
    After this patch:
    
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        reading Parquet file                            80 /   95          3.3  
       306.5       1.0X
    
    Before this patch:
    
        Parquet reader:                          Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        reading Parquet file                            80 /  103          3.3  
       306.7       1.0X
    
    For non-pushdown case, from the results, I think this patch doesn't affect 
normal code path.
    
    I've manually output the `totalRowCount` in 
`SpecificParquetRecordReaderBase` to see if this patch actually filter the 
row-groups. When running the above benchmark:
    
    After this patch:
        `totalRowCount = 0`
    
    Before this patch:
        `totalRowCount = 131072`
    
    
    ## How was this patch tested?
    Existing tests should be passed.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 
vectorized-reader-push-down-filter2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13701.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13701
    
----
commit 5687a3b5527817c809244305468bfe4968bedcec
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-05-28T05:03:06Z

    Try to push down filter at RowGroups level for parquet reader.

commit 077f7f8813a76d38c8a6d898ec54e401c91b6014
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-09T21:19:47Z

    Merge remote-tracking branch 'upstream/master' into 
vectorized-reader-push-down-filter
    
    Conflicts:
        
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala

commit 97ccacfca1f7a039bc7bf7b8a4f8f975deb70197
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-14T07:22:53Z

    Merge remote-tracking branch 'upstream/master' into 
vectorized-reader-push-down-filter

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to