GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/14067
[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and
outer name are the same in Parquet
## What changes were proposed in this pull request?
Currently, if there is a schema as below:
```
root
|-- _1: struct (nullable = true)
| |-- _1: string (nullable = true)
```
and if we execute the codes below:
```scala
df.filter("_1 IS NOT NULL").count()
```
pushes down a filter although this filter is being applied to
`StructType`.(If my understanding is correct, Spark does not pushes down
filters for those).
The reason is, `ParquetFilters.getFieldMap` produces results below:
```
(_1,StructType(StructField(_1,IntegerType,true)))
(_1,IntegerType)
```
and then it becomes a `Map`
```
(_1,IntegerType)
```
Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this
pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it
is actually `StructType`.
So, Parquet filter2 is failed and then
```
df.filter("_1 IS NOT NULL").count()
```
produces 0.
This PR prevents this by pre-checking supported types for Parquet filter
push down.
## How was this patch tested?
Unit test in `ParquetFilterSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-16371
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14067.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14067
----
commit 54b834838c1855fcea0d108b19b89119fe351618
Author: hyukjinkwon <[email protected]>
Date: 2016-07-06T10:00:53Z
Do not push down filters incorrectly when inner name and outer name are the
same
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]