[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Arvind Prabhakar (JIRA) Thu, 17 Jun 2010 15:34:48 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879983#action_12879983
 ]


Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@John: Thanks for reviewing this change. I have some follow-up comments and 
suggestions:

bq. isDistinct: this doesn't actually modify the choice of evaluator 
implementation at all, since the actual duplicate elimination takes place 
upstream of the UDAF invocation. So instead of adding this parameter, can we 
instead add a new method supportsDistinct() on GenericUDAFEvaluator? 

While the evaluation may be happening upstream, I was concerned that it does 
not exclude the cases where this information is relevant to the function 
invocation itself. For example, the implementation of {{count}} requires that 
if there is a valid argument list, it must be qualified with {{DISTINCT}}.

bq. isAllColumns: COUNT is probably the only function which is ever even going 
to care about this one. Couldn't we just use an empty array of TypeInfo to 
indicate all columns?

I had a similar idea, but after some consideration opted for a simpler design. 
I felt that overloading arguments to indicate special cases might lead to 
confusion and eventual problem when a use-case emerges that invalidates this 
assumption. 

I do agree with your point that it will be good to stay compatible if possible. 
One way to do it would be as follows:

# Revert the {{GenericUDAFResolver}} to its previous state but make the 
interface deprecated in favor of the abstract base class.
# Push the newly introduced method into {{AbstractGenericUDAFResolver}} 
implementation.
# Modify {{FunctionRegistry.getGenericUDAFEvaluator()}} method to test the 
resolver instance to be type compatible with {{AbstractGenericUDAFResolver}} 
and if so, invoke the new method. Otherwise revert to the old mechanism.

What do you think about this approach?


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Reply via email to