[ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886339#action_12886339 ]
Arvind Prabhakar commented on HIVE-287: --------------------------------------- @Zheng: Welcome to the party. bq. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework, instead of individual UDAF. This is good because the logic of removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val). Providing the information in the parameter specification is not the same as enforcing its interpretation. This is provided primarily to ensure that UDAFs that rely on this information can make appropriate decisions. For example, we wanted to disallow the invocation {{COUNT( EXPR1, EXPR2 ...)}} in favor of {{COUNT(*DISTINCT* EXPR1, EXPR2 ...)}}. Without this information, the count UDAF will not be able to enforce the later syntax. bq. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive already supports regex-based multi-column specification, so that we can say `abc.*` for all columns with name starting with abc. The compiler should just expand * and give all the columns to the UDAF. If you wish to use \* as a regular expression, you would have to quote it as a string - {{COUNT('\*')}}. This is different from the invocation as specified in SQL which treats \* as a terminal symbol. So if it is OK to deviate from the standard representation, the user can easily use the quoted string representation to achieve the effect similar to {{COUNT(col1, col2 ..)}}. The semantics of this should be more like {{COUNT(DISTINCT EXPR1, EXPR2 ...)}} as opposed to {{COUNT(\*)}}. bq. Since COUNT(\*) is a special-case in the SQL standard (COUNT(\*) is different from COUNT(col) even if the table has a single column col), I think we should just special-case that and replace that with count(1) at some place. Are you suggesting that we allow the grammar to express {{COUNT(\*)}} syntax, but in the lexical analysis stage turn it into a {{COUNT(1)}}? I can see how that may work - but personally I am not a fan of such an approach. > count distinct on multiple columns does not work > ------------------------------------------------ > > Key: HIVE-287 > URL: https://issues.apache.org/jira/browse/HIVE-287 > Project: Hadoop Hive > Issue Type: Bug > Components: Query Processor > Reporter: Namit Jain > Assignee: Arvind Prabhakar > Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, > HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch > > > The following query does not work: > select count(distinct col1, col2) from Tbl -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.