[ 
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747626#comment-13747626
 ] 

Phabricator commented on HIVE-3562:
-----------------------------------

ashutoshc has requested changes to the revision "HIVE-3562 [jira] Some limit 
can be pushed down to map stage".

INLINE COMMENTS
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 I think we 
don't need a boolean flag. If user has set hive.limit.pushdown.memory.usage=0 
it is pretty clear he wants the optimization to be disabled.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 I didnt 
fully follow the example here. What are the keys/values/rows here?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:154 select 
key from src limit 0; implies user doesn't want any data, so RS doesnt need to 
emit any thing, this optimization is still valid and super useful here. Isn't 
it? So, this should be limit < 0, instead limit <=0
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Can 
you add a comment why you are subtracting limit * 64 in this calculation of 
threshold?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:171 I 
think we need not to special handle limit = 0 by null'ing the collector. As I 
said above better is to make RSHash to null in that case. This avoids a null 
check of collector in processOp().
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 As I 
noted above, lets not have this null check. Apart from you null'ing the 
collector above, could there ever be case when it will be null. I dont think 
so, in that case lets remove this if-branch from inner loop.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:340 Shall 
this be instead collect(keyWritable, value) ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 Didn't 
get your comment for retry here. Can you add more comment for it?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:346 
collect() instead of out.collect() ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 
transient?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:471 It 
will be good to add a LOG.DEBUG() here saying, disabling ReduceSinkHash.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I 
didn't get your TODO here. Can you explain?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 Better 
name: remove ?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:498 Add a 
comment here saying values[index] could be null, if flush() was called after 
this value is added but before remove() is called.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 You 
are using PriorityQueue almost like a sorted set.  I think java.util.TreeSet 
will suffice what you need here.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 I dont 
see you making use of any feature corresponding to Map interface. Looks like 
TreeSet would be sufficient here as well.

  If it so that at both places we can use TreeSet than we do need not these two 
diff implementation of ReduceSinkHash. This will simplify code quite a bit.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:253-264 
This handling of distinct keys case can also be refactored to use 
ReduceSinkHash. Will be a good idea to take this up in a follow up jira.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 Also 
need to mark it transient.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:91 
If we are able to use one DataStructure in ReduceSinkHash (either TreeSet or 
some other) for both cases, we wont need this boolean.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 
How are enforcing this condition of applying this optimization only for last 
RS. Can you add comments about this.
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 Can you also add a 
test for following query with this optimization on:
  explain select distinct(key) from src limit 20;
  explain select count(distinct(key)) from src group by key limit 20;

  select distinct(key) from src limit 20;
  select count(distinct(key)) from src group by key limit 20;
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 It will be good to 
seprate these queries where optimization will not be fired in a seprate .q file.

REVISION DETAIL
  https://reviews.facebook.net/D5967

BRANCH
  HIVE-3562

ARCANIST PROJECT
  hive

To: JIRA, tarball, ashutoshc, navis
Cc: njain

                
> Some limit can be pushed down to map stage
> ------------------------------------------
>
>                 Key: HIVE-3562
>                 URL: https://issues.apache.org/jira/browse/HIVE-3562
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, 
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, 
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to