[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

John Sichi (JIRA) Thu, 17 Jun 2010 16:15:53 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879994#action_12879994
 ]


John Sichi commented on HIVE-287:
---------------------------------

For DISTINCT:   we can check the function invocation itself (during semantic 
analysis) by calling supportsDistinct() immediately after instantiating the 
GenericUDAFEvaluator in SemanticAnalyzer.  This allows strict validation to be 
performed.  Or make the method name checkDistinct and allow the UDAF to throw 
the exception itself.  But I agree that in this case it would be cleaner to 
extend the interface, so I'm fine if we go ahead with that in a non-breaking 
fashion.

For COUNT(*):  if you think about it, COUNT(*) really means "ignore all 
columns" not "count all columns".  So I think an empty array actually makes a 
lot of sense here. Can you think of a case where UDAF(*) even makes sense, 
where UDAF != COUNT?  If you don't have access to any per-row data, what can 
you do other than count it?  I'd say we should actually disallow * for anything 
but COUNT, per the SQL standard.

I like your approach to keeping compatibility via instanceof, so if the 
decision ends up being to add the extra parameters, then we should definitely 
use that approach.  However, extension points should always be interfaces (not 
abstract classes) to allow for stuff like dynamic proxies.  So we would need to 
add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver) with the 
new method, and make AbstractGenericUDAFResolver implement both.

Interface evolution is never pretty, but there is an interface design pattern 
which avoids this particular problem.  Imagine if originally we had defined a 
GenericUDAFResolverInput class inside of Hive itself, with a method 
getParameters() returning TypeInfo [].  HIve would instantiate this and pass an 
input object into getEvaluator, and the evaluator would call 
input.getParameters().  This would have allowed us to add a boolean 
isDistinct() method to GenericUDAFResolverInput without breaking anything 
(source or binary) and without needing to add a new interface; old plugins 
would not know about isDistinct() so they wouldn't call it, and new ones could.

I would argue that if we're going to go to the trouble of adding 
GenericUDAFResolver2, then we should build the pattern above into it as well in 
case we need further evolution later on.

p.s. I'm really glad you're working on this one...every few days I try a 
count(*) against Hive accidentally and then kick myself.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Reply via email to