[
https://issues.apache.org/jira/browse/HIVE-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141631#comment-17141631
]
Jesus Camacho Rodriguez edited comment on HIVE-23723 at 6/22/20, 1:49 AM:
--------------------------------------------------------------------------
Iirc it was disabled by default because the rule pushes an exact limit, which
means that it may result on introducing reducers throughout in the plan, which
could result in additional stages (as you see in the plan above). Thus, it was
only triggered via cost-based decision because if we were not filtering much
data, it could result in regressions. Till we could explore this further and
tune the cost-model, we decided to leave it disabled by default. Fwiw note that
the rule can also push limit through other operators, e.g., union.
It would be great if we could enable the rule, identify the additionally
created {{limit}} operators with a {{topn}} label, and pass the top-n
information via hint to the Hive physical plan generation logic; this would
also open a path to implement a way to being able to create {{topNKey}}
operators from the SQL statement, as [~gopalv] suggested at some point.
However, I understand this may be out of the scope of this patch.
Concerning your patch, it seems you are removing the original limit on top of
the left outer join? Note that you cannot remove it : If you have 5 input rows
on the left side, you know the LOJ will produce at least 5 rows, however you
cannot guarantee the join will produce 5 rows at most. The {{Fetch Operator}}
with limit is guaranteeing you get at most 5 rows, but since the match on the
rule is a {{Limit}} operator, it could be anywhere in the plan, e.g., if CBO
pushes limit operators through other operators.
was (Author: jcamachorodriguez):
Iirc it was disabled by default because the rule pushes an exact limit, which
means that it may result on introducing reducers throughout in the plan, which
could result in additional stages (as you see in the plan above). Thus, it was
only triggered via cost-based decision because if we were not filtering much
data, it could result in regressions. Till we could explore this further and
tune the cost-model, we decided to leave it disabled by default. Fwiw note that
the rule can also push limit through other operators, e.g., union.
It would be great if we could enable the rule, identify the additionally
created {{limit}} operators with a {{topn}} label, and pass the top-n
information via hint to the Hive physical plan generation logic; this would
also open a path to implement a way to being able to create {{topNKey}}
operators from the SQL statement, as [~gopalv] suggested at some point.
However, I understand this may be out of the scope of this patch.
Concerning your patch, it seems you are removing the original limit on top of
the left outer join? Note that you cannot remove it : If you have 5 input rows
on the left side, you know the LOJ will produce at least 5 rows, however you
cannot guarantee how many you will produce at most. The {{Fetch Operator}} with
limit is guaranteeing you get at most 5 rows, but since the match on the rule
is a {{Limit}} operator, it could be anywhere in the plan, e.g., if CBO pushes
limit operators through other operators.
> Limit operator pushdown through LOJ
> -----------------------------------
>
> Key: HIVE-23723
> URL: https://issues.apache.org/jira/browse/HIVE-23723
> Project: Hive
> Issue Type: Improvement
> Components: Hive
> Reporter: Attila Magyar
> Assignee: Attila Magyar
> Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23723.1.patch
>
>
> Limit operator (without an order by) can be pushed through SELECTS and LEFT
> OUTER JOINs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)