GitHub user xwu0226 opened a pull request: https://github.com/apache/spark/pull/16156
[SPARK-18539][SQL]: tolerate pushed-down filter on non-existing parquet columns ## What changes were proposed in this pull request? When `spark.sql.parquet.filterPushdown` is on, SparkSQL's Parquet reader pushes down filters to Parquet file when creating reader, in order to start with filtered blocks. However, when the parquet file does not have the predicate column(s), Parquet-mr throw exceptions complaining the filter column does not existing. This issue will be fixed in parquet-mr 1.9. But Spark 2.1 is still on parquet 1.8. This PR is to tolerate such exception thrown by Parquet-mr and just return all the blocks from the current parquet file to the created SparkSQL parquet reader. Filters will be applied again anyway in later physical plan operation. According to following example physical plan: ``` == Physical Plan == *Project [a#2805, b#2806, c#2807] +- *Filter ((isnotnull(a#2805) && isnull(c#2807)) && (a#2805 < 2)) +- *FileScan parquet [a#2805,b#2806,c#2807] Batched: true, Format: ParquetFormat, Location: InMemoryFileIndex[file:/Users/xinwu/spark/target/tmp/spark-ed6f0c12-6494-4ac5-b485-5b986ef475cc], PartitionFilters: [], PushedFilters: [IsNotNull(a), IsNull(c), LessThan(a,2)], ReadSchema: struct<a:int,b:string,c:int> ``` ## How was this patch tested? A unit test case is added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xwu0226/spark SPARK-18539 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16156.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16156 ---- commit 20401598b7661cf9bd18f2b4dbd977c7a5b76832 Author: Xin Wu <xi...@us.ibm.com> Date: 2016-12-05T21:18:12Z SPARK-18539: fix filtered on non-existing parquet column commit 096ab18887c40761eb7ba79e9c406fe8ca6ce7c0 Author: Xin Wu <xi...@us.ibm.com> Date: 2016-12-05T21:24:13Z SPARK-18539: update testcases ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org