[
https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749306#comment-13749306
]
Phabricator commented on HIVE-3562:
-----------------------------------
ashutoshc has commented on the revision "HIVE-3562 [jira] Some limit can be
pushed down to map stage".
Good stuff!
INLINE COMMENTS
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:528 Yeah I know
historically we have done that way. But in my experience having more configs
always confuses users instead of helping.
ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java:1187 Looks
good. Can you update the patch with this?
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:157 Sounds
good. Please add it in comment.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:325 Sounds
good.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:343 I see.
I think just forwarding in this case is simple and better thing to do. So, lets
leave it that way. Can you add a comment saying its possible to retry to add
this to hash, but we don't do that yet. Also, collect()
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:434-436
Ah.. right. Than better not to add transient, otherwise will cause confusion.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:477 I see.
Can you add comment here that further optimization like this is possible here.
We can do this later.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:495 OK.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:535 I see.
Missed this difference of duplicates. My major motivation in trying to unifying
these two is so that in optimization rule we need not to worry about detecting
whether there is a GBY in mapper or not.
If using TreeSet will not over complicate this, I will suggest that. But if
that complicates things too much here, I am fine with current implementation as
well. I will let you decide on that.
ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:566 Here,
using Set is definitely cleaner (and memory efficient) as far as I can see.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:46
Thanks.
ql/src/test/queries/clientpositive/limit_pushdown.q:14 Thanks. Also, as I
said in comment add
select value,avg(key) from src group by value order by value limit 20;
with reduce sink dedup optimization on.
ql/src/test/queries/clientpositive/limit_pushdown.q:43 Thanks.
REVISION DETAIL
https://reviews.facebook.net/D5967
BRANCH
HIVE-3562
ARCANIST PROJECT
hive
To: JIRA, tarball, ashutoshc, navis
Cc: njain
> Some limit can be pushed down to map stage
> ------------------------------------------
>
> Key: HIVE-3562
> URL: https://issues.apache.org/jira/browse/HIVE-3562
> Project: Hive
> Issue Type: Bug
> Reporter: Navis
> Assignee: Navis
> Priority: Trivial
> Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch,
> HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch,
> HIVE-3562.D5967.6.patch
>
>
> Queries with limit clause (with reasonable number), for example
> {noformat}
> select * from src order by key limit 10;
> {noformat}
> makes operator tree,
> TS-SEL-RS-EXT-LIMIT-FS
> But LIMIT can be partially calculated in RS, reducing size of shuffling.
> TS-SEL-RS(TOP-N)-EXT-LIMIT-FS
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira