[ https://issues.apache.org/jira/browse/HIVE-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809811#comment-13809811 ]
Phabricator commented on HIVE-5657: ----------------------------------- navis has commented on the revision "HIVE-5657 [jira] TopN produces incorrect results with count(distinct)". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:268 Right. it should be -1. I did mistake doing some refactoring. ql/src/java/org/apache/hadoop/hive/ql/exec/TopNHash.java:255 For distinct, it does not store values. Check the key and decide to forward all or exclude all. I'm not sure that the previous version was better. In this time, I've focused simplifying the flow of RS-op. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:255 Yes right. Previously, the key was like this [distributeKey:distinctKey1] [distributeKey:distinctKey2] and each row is serialized in whole by OI structOI[structOI(distributeKey):UnionOI(distinctKey)] Now the key is prepared like this and [distributeKey] [distinctKey1,distinctKey2] serialized for each part directly by inner OI : structOI(distributeKey) and UnionOI(distinctKey) I'm not feel good introducing new interface KeySerializer. But serializing distributeKey multiple time seemed worse than that. ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:125 yes. ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java:211 Changed the name because it was confusing that RS is for MapAggr GBY, which is not. ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java:243 I didn't know there was VectorReduceSinkOperator when I've started this, which made me include more refactorings than just amount of fixing the problem. I think current version of patch is way simpler than that of original. But if it makes merging of vectorization hard, I might create minimal patch just for fix. REVISION DETAIL https://reviews.facebook.net/D13797 To: JIRA, navis Cc: sershe > TopN produces incorrect results with count(distinct) > ---------------------------------------------------- > > Key: HIVE-5657 > URL: https://issues.apache.org/jira/browse/HIVE-5657 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Navis > Priority: Critical > Attachments: D13797.1.patch, example.patch, HIVE-5657.1.patch.txt > > > Attached patch illustrates the problem. > limit_pushdown test has various other cases of aggregations and distincts, > incl. count-distinct, that work correctly (that said, src dataset is bad for > testing these things because every count, for example, produces one record > only), so something must be special about this. > I am not very familiar with distinct- code and these nuances; if someone > knows a quick fix feel free to take this, otherwise I will probably start > looking next week. -- This message was sent by Atlassian JIRA (v6.1#6144)