GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/17339
[SPARK-20010][SQL] Sort information is lost after sort merge join
## What changes were proposed in this pull request?
After sort merge join for inner join, now we only keep left key ordering.
However, after inner join, right key has the same value and order as left key.
So if we need another smj on right key, we will unnecessarily add a sort which
causes additional cost.
As a more complicated example, A join B on A.key = B.key join C on B.key =
C.key join D on A.key = D.key. We will unnecessarily add a sort on B.key when
join {A, B} and C, and add a sort on A.key when join {A, B, C} and D.
To fix this, we need to propagate all sorted information (equivalent
expressions) from bottom up through `outputOrdering` and `SortOrder`.
## How was this patch tested?
Test cases are added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark sortEnhance
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17339.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17339
----
commit f3b5b348e4c28ab795755bb099210af872a700df
Author: wangzhenhua <[email protected]>
Date: 2017-03-17T07:23:32Z
derive sorted expressions for smj
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]