[jira] [Commented] (HIVE-5657) TopN produces incorrect results with count(distinct)

Phabricator (JIRA) Wed, 30 Oct 2013 17:13:53 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809784#comment-13809784
 ]


Phabricator commented on HIVE-5657:
-----------------------------------

sershe has commented on the revision "HIVE-5657 [jira] TopN produces incorrect 
results with count(distinct)".

INLINE COMMENTS
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:125 
so this now supports any number of distincts?
  ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:255 right now this 
only returns forward... is this by design?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:243 should 
all of this also be done for vectorized path?
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:268 I 
fixed it in my patch for vectorized... why is hash needed here?
  If row is excluded we don't need hash, it's only needed when we store the 
value or collect
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:297 if 
index >= 0 this should store value
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:255 
Previously there was just key, which was some columns and optionally one 
distinct.
  Do I read correctly that distribution key is now the same, just without 
distinct?

REVISION DETAIL
  https://reviews.facebook.net/D13797

To: JIRA, navis
Cc: sershe


> TopN produces incorrect results with count(distinct)
> ----------------------------------------------------
>
>                 Key: HIVE-5657
>                 URL: https://issues.apache.org/jira/browse/HIVE-5657
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Navis
>            Priority: Critical
>         Attachments: D13797.1.patch, example.patch, HIVE-5657.1.patch.txt
>
>
> Attached patch illustrates the problem.
> limit_pushdown test has various other cases of aggregations and distincts, 
> incl. count-distinct, that work correctly (that said, src dataset is bad for 
> testing these things because every count, for example, produces one record 
> only), so something must be special about this.
> I am not very familiar with distinct- code and these nuances; if someone 
> knows a quick fix feel free to take this, otherwise I will probably start 
> looking next week. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-5657) TopN produces incorrect results with count(distinct)

Reply via email to