[
https://issues.apache.org/jira/browse/HIVE-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766929#comment-15766929
]
Rui Li commented on HIVE-15474:
-------------------------------
Hi [~jcamachorodriguez], I think it's an interesting optimization and please
help me understand it better. For groupBy + orderBy query, we'll roughly have
the following operator chain (assuming map side aggregation is on - GBY1 and
RS2):
{{GBY1 -- RS2 -- GBY3 -- FS4 -- TS5 -- RS6 -- SEL7 -- FS8}}
Is the proposal to push the limit into GBY3, so that we'll have less data for
the sorting stage? If so, I think that requires the output of GBY3 already
being sorted, right? For MR (and probably Tez too) the shuffled data is sorted
by key within each partition, which means the input to GBY3 is sorted. So
you're implying that given a sorted input, GBY operator will maintain that
order in output?
If my understanding is correct so far, this will have some problem with Hive on
Spark. Because for Spark, the groupBy shuffling doesn't guarantee an order -
the input to GBY3 may not be sorted. This makes sense in that groupBy semantic
doesn't require a specific ordering. But maybe we can make some change to adapt
to this optimization. Actually I'm also wondering: if we use parallel order by
(to use a range partitioner rather than a hash partitioner in RS2), we can do
the groupBy and orderBy in a single stage, which may improve performance in
some cases.
> Extend limit propagation for chain of RS-GB-RS operators
> --------------------------------------------------------
>
> Key: HIVE-15474
> URL: https://issues.apache.org/jira/browse/HIVE-15474
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Affects Versions: 2.2.0
> Reporter: Jesus Camacho Rodriguez
> Assignee: Jesus Camacho Rodriguez
> Attachments: HIVE-15474.patch
>
>
> The goal is to extend the work started in HIVE-14002.
> For instance, given the following query:
> {code:sql}
> explain
> select key, value, count(key + 1) as agg1 from src
> group by key, value
> order by key, value, agg1 limit 20;
> {code}
> We can push the limit to the GBy operator. However, currently we do not do it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)