GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/14117
[SPARK-16461][SQL] Support partition batch pruning with `<=>` predicate in
InMemoryTableScanExec
## What changes were proposed in this pull request?
It seems `EqualNullSafe` filter was missed for batch pruneing partitions in
cached tables.
It seems supporting this improves the performance roughly 5 times faster.
Running the codes below:
```scala
test("Null-safe equal comparison") {
val N = 20000000
val df = spark.range(N).repartition(20)
val benchmark = new Benchmark("Null-safe equal comparison", N)
df.createOrReplaceTempView("t")
spark.catalog.cacheTable("t")
sql("select id from t where id <=> 1").collect()
benchmark.addCase("Null-safe equal comparison", 10) { _ =>
sql("select id from t where id <=> 1").collect()
}
benchmark.run()
}
```
produces the results below:
**Before:**
```
Running benchmark: Null-safe equal comparison
Running case: Null-safe equal comparison
Stopped after 10 iterations, 2098 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Null-safe equal comparison: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Null-safe equal comparison 204 / 210 98.1
10.2 1.0X
```
**After:**
```
Running benchmark: Null-safe equal comparison
Running case: Null-safe equal comparison
Stopped after 10 iterations, 478 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.11.5
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Null-safe equal comparison: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Null-safe equal comparison 42 / 48 474.1
2.1 1.0X
```
## How was this patch tested?
Unit tests in `PartitionBatchPruningSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-16461
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14117.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14117
----
commit a7c750ec236973bec29380c39f9b7e1627979d04
Author: hyukjinkwon <[email protected]>
Date: 2016-07-09T13:41:01Z
Support partition batch pruning with `<=>` (EqualNullSafe) predicate in
InMemoryTableScanExec
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]