Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/7334#discussion_r51971305
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -337,8 +337,12 @@ private[sql] abstract class SparkStrategies extends
QueryPlanner[SparkPlan] {
execution.Sample(lb, ub, withReplacement, seed, planLater(child))
:: Nil
case logical.LocalRelation(output, data) =>
LocalTableScan(output, data) :: Nil
+ case logical.ReturnAnswer(logical.Limit(IntegerLiteral(limit),
child)) =>
+ execution.CollectLimit(limit, planLater(child)) :: Nil
case logical.Limit(IntegerLiteral(limit), child) =>
- execution.Limit(limit, planLater(child)) :: Nil
+ val perPartitionLimit = execution.LocalLimit(limit,
planLater(child))
--- End diff --
@gatorsmile: I think we're in agreement. To recap:
Before this patch, ` select * from (select * from A limit 10 ) t1 UNION
ALL (select * from B limit 10) t2` should get planned as:
```
âââââââââââ
â Union â
âââââââââââ
â²
ââââââââââ´âââââââââ
â â
ââââââââââââ
ââââââââââââ
â Limit 10 â â Limit 10 â
ââââââââââââ
ââââââââââââ
â² â²
â â
ââââââââââââ
ââââââââââââ
â Scan A â â Scan B â
ââââââââââââ
ââââââââââââ
```
Afterwards, this becomes
```
âââââââââââ
â Union â
âââââââââââ
â²
ââââââââââ´âââââââââââ
â â
âââââââââââââââââ
ââââââââââââââââ
âGlobalLimit 10 â âGlobalLimit 10â
âââââââââââââââââ
ââââââââââââââââ
â² â²
â â
ââââââââââââââââ
ââââââââââââââââ
âLocalLimit 10 â â LocaLimit 10 â
ââââââââââââââââ
ââââââââââââââââ
â² â²
â â
ââââââââââââ
ââââââââââââ
â Scan A â â Scan B â
ââââââââââââ
ââââââââââââ
```
What is **not** legal to do here is to pull the `GlobalLimit` up, so the
following would be wrong:
```
âââââââââââââââââ
âGlobalLimit 10 â
âââââââââââââââââ
â²
â
âââââââââââ
â Union â
âââââââââââ
â²
ââââââââââ´âââââââââââ
â â
ââââââââââââââââ
ââââââââââââââââ
âLocalLimit 10 â â LocaLimit 10 â
ââââââââââââââââ
ââââââââââââââââ
â² â²
â â
ââââââââââââ
ââââââââââââ
â Scan A â â Scan B â
ââââââââââââ
ââââââââââââ
```
That plan would be semantically equivalent to executing
```
select * from (select * from A ) t1 UNION ALL (select * from B) t2 LIMIT
10
```
@davies, were you suggesting that the current planning is wrong? Or that we
need more tests to guard against incorrect changes to limit planning? I don't
believe that the changes in this patch will affect the planning of the case
being described here, since we're not making any changes to limit pull-up or
push-down. I _do_ have a followup patch in the works which takes @gatorsmile's
two limit-pushdown patches and rebases them on top of the changes here:
https://github.com/apache/spark/compare/master...JoshRosen:limit-pushdown-2. In
that patch, I do plan to add more tests to handle these pushdown-related
concerns.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]