GitHub user gczsjdy opened a pull request:
https://github.com/apache/spark/pull/19862
Make SortMergeJoin read less data when wholeStageCodegen is off
## What changes were proposed in this pull request?
In SortMergeJoin(with wholeStageCodegen), an optimization already exists:
if the left table of a partition is empty then there is no need to read the
right table of this corresponding partition. This benefits the case in which
many partitions of left table is empty and the right table is big.
While in the code path without wholeStageCodegen, this optimization doesn't
happen. This is mainly due to the lack of optimization in codegen-SortMergeJoin
& the lack of support in `SortExec`, which doesn't do lazy evaluation. UI when
wholeStageCodegen is off:
<img width="908" alt="off_wholestage_before"
src="https://user-images.githubusercontent.com/7685352/33493586-8311ac58-d6fb-11e7-816c-7a0fb2065345.PNG">
When the switch is on:

This PR lazy evaluates sort, and only trigger the right table read after
reading the left table and find it's not empty.
After this PR, with wholeStageCodegen off:

## How was this patch tested?
Existed test suites.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gczsjdy/spark smj_less_read
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19862.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19862
----
commit a8511e8b7e0240c471f88e703545301425960901
Author: GuoChenzhao <[email protected]>
Date: 2017-11-28T09:01:34Z
Init solution
commit a5c14b473636e571c6d4f17798f220be080a27a1
Author: GuoChenzhao <[email protected]>
Date: 2017-11-28T09:31:07Z
Style
commit a26ca57d56b0b2df81daa82da49f4ff564fc10f5
Author: GuoChenzhao <[email protected]>
Date: 2017-11-28T09:37:54Z
Comments
commit 6d875f84de95d67f84ef9774cb6b6ee8273d46a6
Author: GuoChenzhao <[email protected]>
Date: 2017-12-01T11:32:06Z
lazy evaluation in SortExec
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]