[
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879994#action_12879994
]
John Sichi commented on HIVE-287:
---------------------------------
For DISTINCT: we can check the function invocation itself (during semantic
analysis) by calling supportsDistinct() immediately after instantiating the
GenericUDAFEvaluator in SemanticAnalyzer. This allows strict validation to be
performed. Or make the method name checkDistinct and allow the UDAF to throw
the exception itself. But I agree that in this case it would be cleaner to
extend the interface, so I'm fine if we go ahead with that in a non-breaking
fashion.
For COUNT(*): if you think about it, COUNT(*) really means "ignore all
columns" not "count all columns". So I think an empty array actually makes a
lot of sense here. Can you think of a case where UDAF(*) even makes sense,
where UDAF != COUNT? If you don't have access to any per-row data, what can
you do other than count it? I'd say we should actually disallow * for anything
but COUNT, per the SQL standard.
I like your approach to keeping compatibility via instanceof, so if the
decision ends up being to add the extra parameters, then we should definitely
use that approach. However, extension points should always be interfaces (not
abstract classes) to allow for stuff like dynamic proxies. So we would need to
add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver) with the
new method, and make AbstractGenericUDAFResolver implement both.
Interface evolution is never pretty, but there is an interface design pattern
which avoids this particular problem. Imagine if originally we had defined a
GenericUDAFResolverInput class inside of Hive itself, with a method
getParameters() returning TypeInfo []. HIve would instantiate this and pass an
input object into getEvaluator, and the evaluator would call
input.getParameters(). This would have allowed us to add a boolean
isDistinct() method to GenericUDAFResolverInput without breaking anything
(source or binary) and without needing to add a new interface; old plugins
would not know about isDistinct() so they wouldn't call it, and new ones could.
I would argue that if we're going to go to the trouble of adding
GenericUDAFResolver2, then we should build the pattern above into it as well in
case we need further evolution later on.
p.s. I'm really glad you're working on this one...every few days I try a
count(*) against Hive accidentally and then kick myself.
> count distinct on multiple columns does not work
> ------------------------------------------------
>
> Key: HIVE-287
> URL: https://issues.apache.org/jira/browse/HIVE-287
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: Namit Jain
> Assignee: Arvind Prabhakar
> Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.