GitHub user dbtsai opened a pull request:
https://github.com/apache/spark/pull/21442
[SPARK-24402] [SQL] Optimize `In` expression when only one element in the
collection or collection is empty
## What changes were proposed in this pull request?
Two new rules in the logical plan optimizers are added.
1. When there is only one element in the **`Collection`**, the
physical plan will be optimized to **`EqualTo`**, so predicate
pushdown can be used.
```scala
profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
"""
|== Physical Plan ==
|*(1) Project [profileID#0]
|+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
| +- *(1) FileScan parquet [profileID#0] Batched: true, Format:
Parquet,
| PartitionFilters: [],
| PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
| ReadSchema: struct<profileID:int>
""".stripMargin
```
2. When the **`Collection`** is empty, and the input is nullable, the
logical plan will be simplified to
```scala
profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
"""
|== Optimized Logical Plan ==
|Filter if (isnull(profileID#0)) null else false
|+- Relation[profileID#0] parquet
""".stripMargin
```
TODO:
1. For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
when the numbers of the categories are low, and they are **`Int`**,
**`Long`**.
3. The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
4. **`filter(if (condition) null else false)`** can be optimized to false.
## How was this patch tested?
Couple new tests are added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dbtsai/spark optimize-in
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21442.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21442
----
commit 7c44c70fe664e73b36a49a974ece93a0c83d7f8e
Author: DB Tsai <d_tsai@...>
Date: 2018-05-28T07:27:09Z
Optimize `In`
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]