[
https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782893#comment-15782893
]
Jesus Camacho Rodriguez commented on HIVE-15474:
------------------------------------------------
[~lirui], sorry for taking some time replying, I was away for a few days.
I have updated the description of this issue to try to explain better what I am
trying to do. I think initial description was too brief and vague.
Given the physical plan you provide, we will propagate the limit from RS4 into
RS2 (observe the explain plan above). RS2 produces the _top N_ keys for each
partition; thus, GBY3 operator produces only results for those keys. Observe in
the patch that there is no change for the GBY operator.
Concerning Spark. My current understanding is that the chain of operators is
the same. But I was thinking further about it, and this optimization should not
pose any problem in that context, since GBY logic has not changed. If Spark
chooses to ignore RS2 since it is not sorting the input for GBY3, that should
be fine: the limit is in the RS2 operator, not in GBY3. Spark will not benefit
from the optimization, but it still remains correct.
{quote}
Actually I'm also wondering: if we use parallel order by (to use a range
partitioner rather than a hash partitioner in RS2), we can do the groupBy and
orderBy in a single stage, which may improve performance in some cases.
{quote}
It might be beneficial in some cases indeed. However, it is a complex
cost-based decision which would need multiple extensions, as I can think on
multiple factors that would influence it, e.g., data skew, number of records
for the top N groups, the limit of records itself, etc.
> Extend limit propagation for chain of RS-GB-RS operators
> --------------------------------------------------------
>
> Key: HIVE-15474
> URL: https://issues.apache.org/jira/browse/HIVE-15474
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Affects Versions: 2.2.0
> Reporter: Jesus Camacho Rodriguez
> Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-15474.patch
>
>
> The goal is to extend the work started in HIVE-14002.
> For instance, given the following query:
> {code:sql}
> explain
> select key, value, count(key + 1) as agg1 from src
> group by key, value
> order by key, value, agg1 limit 20;
> {code}
> We generate the following physical plan:
> {{TS1 - GBY2 - RS3 - GBY4 - RS5 - SEL6 - LIM7 - FS8}}
> We can push the limit to RS3 operator, as we will generate records for the
> _top N_ keys, and thus, GBY4 will produce the _top N_ results. However,
> currently we do not do it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)