[GitHub] spark pull request: [SPARK-12503][SPARK-12505] Limit pushdown in U...

JoshRosen Mon, 08 Feb 2016 13:19:38 -0800

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/11121


    [SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN

    This patch adds a new optimizer rule for performing limit pushdown. Limits 
will now be pushed down in two cases:
    
    - If a limit is on top of a `UNION ALL` operator, then a partition-local 
limit operator will be pushed to each of the union operator's children.
    - If a limit is on top of an `OUTER JOIN` then a partition-local limit will 
be pushed to the outer side of the join (for full outer joins, the limit will 
be pushed to both sides).
    
    These optimizations were proposed previously by @gatorsmile in #10451 and 
#10454, but those earlier PRs were closed and deferred for later because at 
that time Spark's physical `Limit` operator would trigger a full shuffle to 
perform global limits so there was a chance that pushdowns could actually harm 
performance by causing additional shuffles/stages. In #7334, we split the 
`Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we 
can now push down only local limits (which don't require extra shuffles). This 
patch is based on both of @gatorsmile's patches, with changes and 
simplifications due to partition-local-limiting.
    
    When we push down the limit, we still keep the original limit in place, so 
we need a mechanism to ensure that the optimizer rule doesn't keep 
pattern-matching once the limit has been pushed down. In order to handle this, 
this patch adds a `maxRows` method to `SparkPlan` which returns the maximum 
number of rows that the plan can compute, then defines the pushdown rules to 
only push limits to children if the children's maxRows are greater than the 
limit's maxRows. This idea is carried over from #10451; see that patch for 
additional discussion.
    
    /cc @marmbrus @rxin @gatorsmile for review (I plan to update the test cases 
and add a few more cases to ensure that pushdown does _not_ take place in 
certain circumstances, but this patch is about 90% ready to go and the core 
changes should be ready for review).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark limit-pushdown-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11121
    
----
commit 9df9ffddc56768f4737e9f4e8587ad6e3629d945
Author: Josh Rosen <[email protected]>
Date:   2016-02-04T01:43:19Z

    Split logical limit operator.

commit 449e96e4a04c217b19e1150525a67f38abd30cf1
Author: Josh Rosen <[email protected]>
Date:   2016-02-04T01:45:08Z

    Also split physical planning.

commit 060b9b870e6a16a457f4712f8548e597082565a1
Author: Josh Rosen <[email protected]>
Date:   2016-02-04T02:44:48Z

    Import gatorsmile's pushdown through union and clean up to work with new 
split Limit operator.

commit 00e7f393c33090dc535f1b2892f2184846264c75
Author: Josh Rosen <[email protected]>
Date:   2016-02-04T03:06:20Z

    Fix join pushdown.

commit 7b86111bb76ba7fd2e1a9d7bb77b345e9a82bad2
Author: Josh Rosen <[email protected]>
Date:   2016-02-04T03:31:34Z

    Define maxRows in more operators.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12503][SPARK-12505] Limit pushdown in U...

Reply via email to