GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/16184
[SPARK-18753][SQL] Keep pushed-down null literal as a filter in Spark-side
post-filter for FileFormat datasources
## What changes were proposed in this pull request?
Currently, `FileSourceStrategy` does not handle the case when the
pushed-down filter is `Literal(null)` and removes it at the post-filter in
Spark-side.
For example, the codes below:
```scala
val ds = Seq(Tuple1(Some(true)), Tuple1(None), Tuple1(Some(false))).toDS()
ds.filter($"_1" === "true").explain(true)
```
shows it keeps `null` properly.
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- LocalRelation [_1#17]
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#17 as double) = cast(true as double))
+- LocalRelation [_1#17]
== Optimized Logical Plan ==
Filter (isnotnull(_1#17) && null)
+- LocalRelation [_1#17]
== Physical Plan ==
*Filter (isnotnull(_1#17) && null)
+- LocalTableScan [_1#17]
```
However, when we read it back from Parquet,
```
ds.write.parquet(path)
spark.read.parquet(path).filter($"_1" === "true").explain(true)
```
`null` is removed at the post-filter.
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#11] parquet
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#11 as double) = cast(true as double))
+- Relation[_1#11] parquet
== Optimized Logical Plan ==
Filter (isnotnull(_1#11) && null)
+- Relation[_1#11] parquet
== Physical Plan ==
*Project [_1#11]
+- *Filter isnotnull(_1#11)
+- *FileScan parquet [_1#11] Batched: true, Format: ParquetFormat,
Location: InMemoryFileIndex[file:/tmp/testfile], PartitionFilters: [null],
PushedFilters: [IsNotNull(_1)], ReadSchema: struct<_1:boolean>
```
This PR fixes it to keep it properly. In more details,
```scala
val partitionKeyFilters =
ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
```
This keeps this `null` in `partitionKeyFilters` as `Literal` always don't
have `children` and `references` is being empty which is always the subset of
`partitionSet`.
And then in
```scala
val afterScanFilters = filterSet -- partitionKeyFilters
```
`null` is always removed from the post filter. So, if the referenced fields
are empty, it should be applied into both for partitioned columns and data
columns.
After this PR, it becomes as below:
```
== Parsed Logical Plan ==
'Filter ('_1 = true)
+- Relation[_1#276] parquet
== Analyzed Logical Plan ==
_1: boolean
Filter (cast(_1#276 as double) = cast(true as double))
+- Relation[_1#276] parquet
== Optimized Logical Plan ==
Filter (isnotnull(_1#276) && null)
+- Relation[_1#276] parquet
== Physical Plan ==
*Project [_1#276]
+- *Filter (isnotnull(_1#276) && null)
+- *FileScan parquet [_1#276] Batched: true, Format: ParquetFormat,
Location:
InMemoryFileIndex[file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-a5d59bdb-5b...,
PartitionFilters: [null], PushedFilters: [IsNotNull(_1)], ReadSchema:
struct<_1:boolean>
```
## How was this patch tested?
Unit test in `FileSourceStrategySuite`
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-18753
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16184.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16184
----
commit c6fe34511fc1ea5c36713d435dc64673deceae7f
Author: hyukjinkwon <[email protected]>
Date: 2016-12-07T02:39:26Z
keep pushed-down null literal as a filter in Spark-side post-filter
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]