[ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879994#action_12879994 ]
John Sichi commented on HIVE-287: --------------------------------- For DISTINCT: we can check the function invocation itself (during semantic analysis) by calling supportsDistinct() immediately after instantiating the GenericUDAFEvaluator in SemanticAnalyzer. This allows strict validation to be performed. Or make the method name checkDistinct and allow the UDAF to throw the exception itself. But I agree that in this case it would be cleaner to extend the interface, so I'm fine if we go ahead with that in a non-breaking fashion. For COUNT(*): if you think about it, COUNT(*) really means "ignore all columns" not "count all columns". So I think an empty array actually makes a lot of sense here. Can you think of a case where UDAF(*) even makes sense, where UDAF != COUNT? If you don't have access to any per-row data, what can you do other than count it? I'd say we should actually disallow * for anything but COUNT, per the SQL standard. I like your approach to keeping compatibility via instanceof, so if the decision ends up being to add the extra parameters, then we should definitely use that approach. However, extension points should always be interfaces (not abstract classes) to allow for stuff like dynamic proxies. So we would need to add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver) with the new method, and make AbstractGenericUDAFResolver implement both. Interface evolution is never pretty, but there is an interface design pattern which avoids this particular problem. Imagine if originally we had defined a GenericUDAFResolverInput class inside of Hive itself, with a method getParameters() returning TypeInfo []. HIve would instantiate this and pass an input object into getEvaluator, and the evaluator would call input.getParameters(). This would have allowed us to add a boolean isDistinct() method to GenericUDAFResolverInput without breaking anything (source or binary) and without needing to add a new interface; old plugins would not know about isDistinct() so they wouldn't call it, and new ones could. I would argue that if we're going to go to the trouble of adding GenericUDAFResolver2, then we should build the pattern above into it as well in case we need further evolution later on. p.s. I'm really glad you're working on this one...every few days I try a count(*) against Hive accidentally and then kick myself. > count distinct on multiple columns does not work > ------------------------------------------------ > > Key: HIVE-287 > URL: https://issues.apache.org/jira/browse/HIVE-287 > Project: Hadoop Hive > Issue Type: Bug > Components: Query Processor > Reporter: Namit Jain > Assignee: Arvind Prabhakar > Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch > > > The following query does not work: > select count(distinct col1, col2) from Tbl -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.