GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/13701
[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet
reader
## What changes were proposed in this pull request?
The base class `SpecificParquetRecordReaderBase` used for vectorized
parquet reader will try to get pushed-down filters from the given
configuration. This pushed-down filters are used for RowGroups-level filtering.
However, we don't set up the filters to push down into the configuration. In
other words, the filters are not actually pushed down to do RowGroups-level
filtering. This patch is to fix this and tries to set up the filters for
pushing down to configuration for the reader.
The benchmark that excludes the time of writing Parquet file:
test("Benchmark for Parquet") {
val N = 1 << 50
withParquetTable((0 until N).map(i => (101, i)), "t") {
val benchmark = new Benchmark("Parquet reader", N)
benchmark.addCase("reading Parquet file", 10) { iter =>
sql("SELECT _1 FROM t where t._1 < 100").collect()
}
benchmark.run()
}
}
`withParquetTable` in default will run tests for vectorized reader
non-vectorized readers. I only let it run vectorized reader.
After this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux
3.13.0-57-generic
Westmere E56xx/L56xx/X56xx (Nehalem-C)
Parquet reader: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 76 / 88 3.4
291.0 1.0X
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17 on Linux
3.13.0-57-generic
Westmere E56xx/L56xx/X56xx (Nehalem-C)
Parquet reader: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 81 / 91 3.2
310.2 1.0X
Next, I run the benchmark for non-pushdown case using the same benchmark
code but with disabled pushdown configuration.
After this patch:
Parquet reader: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 80 / 95 3.3
306.5 1.0X
Before this patch:
Parquet reader: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Parquet file 80 / 103 3.3
306.7 1.0X
For non-pushdown case, from the results, I think this patch doesn't affect
normal code path.
I've manually output the `totalRowCount` in
`SpecificParquetRecordReaderBase` to see if this patch actually filter the
row-groups. When running the above benchmark:
After this patch:
`totalRowCount = 0`
Before this patch:
`totalRowCount = 131072`
## How was this patch tested?
Existing tests should be passed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1
vectorized-reader-push-down-filter2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13701.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13701
----
commit 5687a3b5527817c809244305468bfe4968bedcec
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-05-28T05:03:06Z
Try to push down filter at RowGroups level for parquet reader.
commit 077f7f8813a76d38c8a6d898ec54e401c91b6014
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-09T21:19:47Z
Merge remote-tracking branch 'upstream/master' into
vectorized-reader-push-down-filter
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
commit 97ccacfca1f7a039bc7bf7b8a4f8f975deb70197
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-14T07:22:53Z
Merge remote-tracking branch 'upstream/master' into
vectorized-reader-push-down-filter
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]