GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/11121
[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN
This patch adds a new optimizer rule for performing limit pushdown. Limits
will now be pushed down in two cases:
- If a limit is on top of a `UNION ALL` operator, then a partition-local
limit operator will be pushed to each of the union operator's children.
- If a limit is on top of an `OUTER JOIN` then a partition-local limit will
be pushed to the outer side of the join (for full outer joins, the limit will
be pushed to both sides).
These optimizations were proposed previously by @gatorsmile in #10451 and
#10454, but those earlier PRs were closed and deferred for later because at
that time Spark's physical `Limit` operator would trigger a full shuffle to
perform global limits so there was a chance that pushdowns could actually harm
performance by causing additional shuffles/stages. In #7334, we split the
`Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we
can now push down only local limits (which don't require extra shuffles). This
patch is based on both of @gatorsmile's patches, with changes and
simplifications due to partition-local-limiting.
When we push down the limit, we still keep the original limit in place, so
we need a mechanism to ensure that the optimizer rule doesn't keep
pattern-matching once the limit has been pushed down. In order to handle this,
this patch adds a `maxRows` method to `SparkPlan` which returns the maximum
number of rows that the plan can compute, then defines the pushdown rules to
only push limits to children if the children's maxRows are greater than the
limit's maxRows. This idea is carried over from #10451; see that patch for
additional discussion.
/cc @marmbrus @rxin @gatorsmile for review (I plan to update the test cases
and add a few more cases to ensure that pushdown does _not_ take place in
certain circumstances, but this patch is about 90% ready to go and the core
changes should be ready for review).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark limit-pushdown-2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11121.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11121
----
commit 9df9ffddc56768f4737e9f4e8587ad6e3629d945
Author: Josh Rosen <[email protected]>
Date: 2016-02-04T01:43:19Z
Split logical limit operator.
commit 449e96e4a04c217b19e1150525a67f38abd30cf1
Author: Josh Rosen <[email protected]>
Date: 2016-02-04T01:45:08Z
Also split physical planning.
commit 060b9b870e6a16a457f4712f8548e597082565a1
Author: Josh Rosen <[email protected]>
Date: 2016-02-04T02:44:48Z
Import gatorsmile's pushdown through union and clean up to work with new
split Limit operator.
commit 00e7f393c33090dc535f1b2892f2184846264c75
Author: Josh Rosen <[email protected]>
Date: 2016-02-04T03:06:20Z
Fix join pushdown.
commit 7b86111bb76ba7fd2e1a9d7bb77b345e9a82bad2
Author: Josh Rosen <[email protected]>
Date: 2016-02-04T03:31:34Z
Define maxRows in more operators.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]