[
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747626#comment-13747626
]
Phabricator commented on HIVE-3562:
-----------------------------------
ashutoshc has requested changes to the revision "HIVE-3562 [jira] Some limit
can be pushed down to map stage".
INLINE COMMENTS
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I think we
don't need a boolean flag. If user has set hive.limit.pushdown.memory.usage=0
it is pretty clear he wants the optimization to be disabled.
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 I didnt
fully follow the example here. What are the keys/values/rows here?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 select
key from src limit 0; implies user doesn't want any data, so RS doesnt need to
emit any thing, this optimization is still valid and super useful here. Isn't
it? So, this should be limit < 0, instead limit <=0
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Can
you add a comment why you are subtracting limit * 64 in this calculation of
threshold?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:171 I
think we need not to special handle limit = 0 by null'ing the collector. As I
said above better is to make RSHash to null in that case. This avoids a null
check of collector in processOp().
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 As I
noted above, lets not have this null check. Apart from you null'ing the
collector above, could there ever be case when it will be null. I dont think
so, in that case lets remove this if-branch from inner loop.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Shall
this be instead collect(keyWritable, value) ?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 Didn't
get your comment for retry here. Can you add more comment for it?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346
collect() instead of out.collect() ?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436
transient?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 It
will be good to add a LOG.DEBUG() here saying, disabling ReduceSinkHash.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I
didn't get your TODO here. Can you explain?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 Better
name: remove ?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Add a
comment here saying values[index] could be null, if flush() was called after
this value is added but before remove() is called.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 You
are using PriorityQueue almost like a sorted set. I think java.util.TreeSet
will suffice what you need here.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I dont
see you making use of any feature corresponding to Map interface. Looks like
TreeSet would be sufficient here as well.
If it so that at both places we can use TreeSet than we do need not these two
diff implementation of ReduceSinkHash. This will simplify code quite a bit.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264
This handling of distinct keys case can also be refactored to use
ReduceSinkHash. Will be a good idea to take this up in a follow up jira.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Also
need to mark it transient.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:91
If we are able to use one DataStructure in ReduceSinkHash (either TreeSet or
some other) for both cases, we wont need this boolean.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46
How are enforcing this condition of applying this optimization only for last
RS. Can you add comments about this.
ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a
test for following query with this optimization on:
explain select distinct(key) from src limit 20;
explain select count(distinct(key)) from src group by key limit 20;
select distinct(key) from src limit 20;
select count(distinct(key)) from src group by key limit 20;
ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to
seprate these queries where optimization will not be fired in a seprate .q file.
REVISION DETAIL
https://reviews.facebook.net/D5967
BRANCH
HIVE-3562
ARCANIST PROJECT
hive
To: JIRA, tarball, ashutoshc, navis
Cc: njain
> Some limit can be pushed down to map stage
> ------------------------------------------
>
> Key: HIVE-3562
> URL: https://issues.apache.org/jira/browse/HIVE-3562
> Project: Hive
> Issue Type: Bug
> Reporter: Navis
> Assignee: Navis
> Priority: Trivial
> Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch,
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch,
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree,
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira