[ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886339#action_12886339
 ] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@Zheng: Welcome to the party.

bq. Why do we put the DISTINCT in the information? DISTINCT is currently done 
by the framework, instead of individual UDAF. This is good because the logic of 
removing duplicates are common for all UDAFs. We do support SUM(DISTINCT val).

Providing the information in the parameter specification is not the same as 
enforcing its interpretation. This is provided primarily to ensure that UDAFs 
that rely on this information can make appropriate decisions. For example, we 
wanted to disallow the invocation {{COUNT( EXPR1, EXPR2 ...)}} in favor of 
{{COUNT(*DISTINCT* EXPR1, EXPR2 ...)}}. Without this information, the count 
UDAF will not be able to enforce the later syntax.

bq. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive 
already supports regex-based multi-column specification, so that we can say 
`abc.*` for all columns with name starting with abc. The compiler should just 
expand * and give all the columns to the UDAF.

If you wish to use \* as a regular expression, you would have to quote it as a 
string - {{COUNT('\*')}}. This is different from the invocation as specified in 
SQL which treats \* as a terminal symbol. So if it is OK to deviate from the 
standard representation, the user can easily use the quoted string 
representation to achieve the effect similar to {{COUNT(col1, col2 ..)}}. The 
semantics of this should be more like {{COUNT(DISTINCT EXPR1, EXPR2 ...)}} as 
opposed to {{COUNT(\*)}}.

bq. Since COUNT(\*) is a special-case in the SQL standard (COUNT(\*) is 
different from COUNT(col) even if the table has a single column col), I think 
we should just special-case that and replace that with count(1) at some place.

Are you suggesting that we allow the grammar to express {{COUNT(\*)}} syntax, 
but in the lexical analysis stage turn it into a {{COUNT(1)}}? I can see how 
that may work - but personally I am not a fan of such an approach. 

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, 
> HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to