GitHub user xwu0226 opened a pull request:
https://github.com/apache/spark/pull/16156
[SPARK-18539][SQL]: tolerate pushed-down filter on non-existing parquet
columns
## What changes were proposed in this pull request?
When `spark.sql.parquet.filterPushdown` is on, SparkSQL's Parquet reader
pushes down filters to Parquet file when creating reader, in order to start
with filtered blocks. However, when the parquet file does not have the
predicate column(s), Parquet-mr throw exceptions complaining the filter column
does not existing. This issue will be fixed in parquet-mr 1.9. But Spark 2.1 is
still on parquet 1.8.
This PR is to tolerate such exception thrown by Parquet-mr and just return
all the blocks from the current parquet file to the created SparkSQL parquet
reader. Filters will be applied again anyway in later physical plan operation.
According to following example physical plan:
```
== Physical Plan ==
*Project [a#2805, b#2806, c#2807]
+- *Filter ((isnotnull(a#2805) && isnull(c#2807)) && (a#2805 < 2))
+- *FileScan parquet [a#2805,b#2806,c#2807] Batched: true, Format:
ParquetFormat, Location:
InMemoryFileIndex[file:/Users/xinwu/spark/target/tmp/spark-ed6f0c12-6494-4ac5-b485-5b986ef475cc],
PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNull(c), LessThan(a,2)],
ReadSchema: struct<a:int,b:string,c:int>
```
## How was this patch tested?
A unit test case is added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xwu0226/spark SPARK-18539
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16156.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16156
----
commit 20401598b7661cf9bd18f2b4dbd977c7a5b76832
Author: Xin Wu <[email protected]>
Date: 2016-12-05T21:18:12Z
SPARK-18539: fix filtered on non-existing parquet column
commit 096ab18887c40761eb7ba79e9c406fe8ca6ce7c0
Author: Xin Wu <[email protected]>
Date: 2016-12-05T21:24:13Z
SPARK-18539: update testcases
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]