GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/11749

    [SPARK-13922][SQL] Filter rows with null attributes in parquet vectorized 
reader

    ## What changes were proposed in this pull request?
    
    It's common for many SQL operators to not care about reading `null` values 
for correctness. Currently, this is achieved by performing `isNotNull` checks 
(for all relevant columns) on a per-row basis. Pushing these null filters in 
parquet vectorized reader should bring considerable benefits (especially for 
cases when the underlying data doesn't contain any nulls or contains all nulls).
    
    ## How was this patch tested?
    
            =======================
            Fraction of NULLs: 0
            =======================
    
            Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
            String with Nulls Scan:             Best/Avg Time(ms)    Rate(M/s)  
 Per Row(ns)   Relative
            
-------------------------------------------------------------------------------------------
            SQL Parquet Vectorized                   1164 / 1333          9.0   
      111.0       1.0X
            PR Vectorized                             809 /  882         13.0   
       77.1       1.4X
            PR Vectorized (Null Filtering)            723 /  800         14.5   
       69.0       1.6X
    
            =======================
            Fraction of NULLs: 0.5
            =======================
    
            Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
            String with Nulls Scan:             Best/Avg Time(ms)    Rate(M/s)  
 Per Row(ns)   Relative
            
-------------------------------------------------------------------------------------------
            SQL Parquet Vectorized                    983 / 1001         10.7   
       93.8       1.0X
            PR Vectorized                             699 /  728         15.0   
       66.7       1.4X
            PR Vectorized (Null Filtering)            722 /  746         14.5   
       68.9       1.4X
    
            =======================
            Fraction of NULLs: 0.95
            =======================
    
            Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
            String with Nulls Scan:             Best/Avg Time(ms)    Rate(M/s)  
 Per Row(ns)   Relative
            
-------------------------------------------------------------------------------------------
            SQL Parquet Vectorized                    332 /  343         31.6   
       31.6       1.0X
            PR Vectorized                             177 /  180         59.1   
       16.9       1.9X
            PR Vectorized (Null Filtering)            168 /  175         62.4   
       16.0       2.0X

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark perf-testing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11749.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11749
    
----
commit adf9f935d135e6c7f4e94d8a17707577d0cc86f5
Author: Sameer Agarwal <[email protected]>
Date:   2016-03-11T00:26:18Z

    Filter null columns in ColumnarBatch

commit af217fe1a5dc4dbdf015e850de67c2b707f7e541
Author: Sameer Agarwal <[email protected]>
Date:   2016-03-15T06:29:47Z

    Parquet Read Benchmark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to