GitHub user KaiXinXiaoLei opened a pull request:
https://github.com/apache/spark/pull/20865
[SPARK-23542] The exists action shoule be further optimized in logical plan
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
The optimized logical plan of query `select * from tt1 where exists (select
* from tt2 where tt1.i = tt2.i)` is
> == Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]
The `exists` action will be rewritten as semi jion. But i the query of
`select * from tt1 left semi join tt2 on tt2.i = tt1.i`, the optimized logical
plan is :
> == Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- `Filter isnotnull`(i#20)
: +- HiveTableRelation `default`.`tt1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]
So i think the optimized logical plan of 'select * from tt1 where exists
(select * from tt2 where tt1.i = tt2.i);` should be further optimization.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
With this patch, the optimized logical plan of 'select * from tt1 where
exists (select * from tt2 where tt1.i = tt2.i);` is:
> == Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#14)
: +- HiveTableRelation `default`.`tt1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
:- Filter isnotnull(i#16)
+- HiveTableRelation `default`.`tt2`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/KaiXinXiaoLei/spark SPARK-23542
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20865.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20865
----
commit 3bf987828acea096811ba8dd1d42de8221cac62d
Author: KaiXinXiaoLei <584620569@...>
Date: 2018-03-02T03:33:26Z
message
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]