[jira] [Commented] (HIVE-5657) TopN produces incorrect results with count(distinct)

Phabricator (JIRA) Wed, 30 Oct 2013 17:51:53 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809811#comment-13809811
 ]


Phabricator commented on HIVE-5657:
-----------------------------------

navis has commented on the revision "HIVE-5657 [jira] TopN produces incorrect 
results with count(distinct)".

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:268 Right. 
it should be -1. I did mistake doing some refactoring.
  ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:255 For distinct, it 
does not store values. Check the key and decide to forward all or exclude all. 
I'm not sure that the previous version was better. In this time, I've focused 
simplifying the flow of RS-op.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:255 Yes 
right. Previously, the key was like this
  [distributeKey:distinctKey1]
  [distributeKey:distinctKey2]
  and each row is serialized in whole by OI
  structOI[structOI(distributeKey):UnionOI(distinctKey)]

  Now the key is prepared like this and
  [distributeKey]
  [distinctKey1,distinctKey2]

  serialized for each part directly by inner OI : structOI(distributeKey) and 
UnionOI(distinctKey)

  I'm not feel good introducing new interface KeySerializer. But serializing 
distributeKey multiple time seemed worse than that.
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:125 
yes.
  ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java:211 Changed 
the name because it was confusing that RS is for MapAggr GBY, which is not.
  ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:243 I 
didn't know there was VectorReduceSinkOperator when I've started this, which 
made me include more refactorings than just amount of fixing the problem. I 
think current version of patch is way simpler than that of original. But if it 
makes merging of vectorization hard, I might create minimal patch just for fix.

REVISION DETAIL
  https://reviews.facebook.net/D13797

To: JIRA, navis
Cc: sershe


> TopN produces incorrect results with count(distinct)
> ----------------------------------------------------
>
>                 Key: HIVE-5657
>                 URL: https://issues.apache.org/jira/browse/HIVE-5657
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Navis
>            Priority: Critical
>         Attachments: D13797.1.patch, example.patch, HIVE-5657.1.patch.txt
>
>
> Attached patch illustrates the problem.
> limit_pushdown test has various other cases of aggregations and distincts, 
> incl. count-distinct, that work correctly (that said, src dataset is bad for 
> testing these things because every count, for example, produces one record 
> only), so something must be special about this.
> I am not very familiar with distinct- code and these nuances; if someone 
> knows a quick fix feel free to take this, otherwise I will probably start 
> looking next week. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-5657) TopN produces incorrect results with count(distinct)

Reply via email to