[ 
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749306#comment-13749306
 ] 

Phabricator commented on HIVE-3562:
-----------------------------------

ashutoshc has commented on the revision "HIVE-3562 [jira] Some limit can be 
pushed down to map stage".

  Good stuff!

INLINE COMMENTS
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 Yeah I know 
historically we have done that way. But in my experience having more configs 
always confuses users instead of helping.
  ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Looks 
good. Can you update the patch with this?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Sounds 
good. Please add it in comment.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 Sounds 
good.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 I see. 
I think just forwarding in this case is simple and better thing to do. So, lets 
leave it that way. Can you add a comment saying its possible to retry to add 
this to hash, but we don't do that yet. Also, collect()
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436 
Ah.. right. Than better not to add transient, otherwise will cause confusion.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I see. 
Can you add comment here that further optimization like this is possible here. 
We can do this later.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 OK.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 I see. 
Missed this difference of duplicates. My major motivation in trying to unifying 
these two is so that in optimization rule we need not to worry about detecting 
whether there is a GBY in mapper or not.
  If using TreeSet will not over complicate this, I will suggest that. But if 
that complicates things too much here, I am fine with current implementation as 
well. I will let you decide on that.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 Here, 
using Set is definitely cleaner (and memory efficient) as far as I can see.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46 
Thanks.
  ql/src/test/queries/clientpositive/limit_pushdown.q:14 Thanks. Also, as I 
said in comment add
  select value,avg(key) from src group by value order by value limit 20;
  with reduce sink dedup optimization on.
  ql/src/test/queries/clientpositive/limit_pushdown.q:43 Thanks.

REVISION DETAIL
  https://reviews.facebook.net/D5967

BRANCH
  HIVE-3562

ARCANIST PROJECT
  hive

To: JIRA, tarball, ashutoshc, navis
Cc: njain

                
> Some limit can be pushed down to map stage
> ------------------------------------------
>
>                 Key: HIVE-3562
>                 URL: https://issues.apache.org/jira/browse/HIVE-3562
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Trivial
>         Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, 
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, 
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree, 
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to