GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/17680
[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots
## What changes were proposed in this pull request?
Currently, if there are dots in the column name, predicate pushdown seems
being failed in Parquet.
**With dots**
```
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```
```
+--------+
|col.dots|
+--------+
+--------+
```
**Without dots**
```
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
```
```
+-------+
|coldots|
+-------+
| 1|
+-------+
```
It seems dot in the column names via `FilterApi` tries to separate the
field name with dot (`ColumnPath` with multiple column paths) whereas the
actual column name is `col.dots`. (See
[FilterApi.java#L71](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71)
and it calls
[ColumnPath.java#L44](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44).
I just tried to come up with ways to resolve it and I came up with two as
below:
- One is simply to don't push down filters when there are dots in column
names so that it reads all and filters in Spark-side.
- The other way creates Spark's `FilterApi` for those columns (it seems
final) to get always use single column path it in Spark-side (this seems hacky)
as we are not pushing down nested columns currently. So, it looks we can get a
field name via `ColumnPath.get` not `ColumnPath.fromDotString` in this way.
This PR proposes the latter way because I think we need to be sure on that
it passes the tests.
**After**
```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```
```
+--------+
|col.dots|
+--------+
| 1|
+--------+
```
## How was this patch tested?
Existing tests should cover this. Some tests additionally were added in
`ParquetFilterSuite.scala`. Manually, I ran related tests and Jenkins tests
will cover this.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-20364
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17680.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17680
----
commit 297c70bd47cb859d343345849776bd7c21396078
Author: hyukjinkwon <[email protected]>
Date: 2017-04-19T06:24:15Z
Parquet predicate pushdown on columns with dots return empty results
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]