GitHub user sameeragarwal opened a pull request:
https://github.com/apache/spark/pull/11749
[SPARK-13922][SQL] Filter rows with null attributes in parquet vectorized
reader
## What changes were proposed in this pull request?
It's common for many SQL operators to not care about reading `null` values
for correctness. Currently, this is achieved by performing `isNotNull` checks
(for all relevant columns) on a per-row basis. Pushing these null filters in
parquet vectorized reader should bring considerable benefits (especially for
cases when the underlying data doesn't contain any nulls or contains all nulls).
## How was this patch tested?
=======================
Fraction of NULLs: 0
=======================
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 1164 / 1333 9.0
111.0 1.0X
PR Vectorized 809 / 882 13.0
77.1 1.4X
PR Vectorized (Null Filtering) 723 / 800 14.5
69.0 1.6X
=======================
Fraction of NULLs: 0.5
=======================
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 983 / 1001 10.7
93.8 1.0X
PR Vectorized 699 / 728 15.0
66.7 1.4X
PR Vectorized (Null Filtering) 722 / 746 14.5
68.9 1.4X
=======================
Fraction of NULLs: 0.95
=======================
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 332 / 343 31.6
31.6 1.0X
PR Vectorized 177 / 180 59.1
16.9 1.9X
PR Vectorized (Null Filtering) 168 / 175 62.4
16.0 2.0X
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sameeragarwal/spark perf-testing
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11749.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11749
----
commit adf9f935d135e6c7f4e94d8a17707577d0cc86f5
Author: Sameer Agarwal <[email protected]>
Date: 2016-03-11T00:26:18Z
Filter null columns in ColumnarBatch
commit af217fe1a5dc4dbdf015e850de67c2b707f7e541
Author: Sameer Agarwal <[email protected]>
Date: 2016-03-15T06:29:47Z
Parquet Read Benchmark
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]