[ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884580#action_12884580 ]
Arvind Prabhakar commented on HIVE-287: --------------------------------------- Thanks for the explanation John. The SQL BNF that you pointed out is the normative SQL specification. I do not think any SQL implementations use this grammar though. The parallel is that of an interface and its implementation. While the interface can be short and precise, the implementations may choose to delegate interface methods to other implementation specific methods. Similarly, most databases deal with their own SQL grammar that is compliant with the SQL standard at specific levels. More to the point in Hive - my key concern is that by modifying the grammar to make an exception for {{COUNT}}, we will be introducing a brittle coupling between the the parser and semantic analyzer. Right now the count aggregate function is treated like any other function and is thus part of the general framework. By making this change, we will be modifying it to be specifically associated from with the grammar directives. This is the current function definition in Hive QL grammar (*A*): {noformat} -->[ functionName ]-->[ LPAREN ]--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+-->[ RPAREN ]--> | | | | | | +----------->--------+ | +------[ COMMA ]<---+ | | | +------------->-----------+ {noformat} The patch that I have supplied already on this Jira modifies this definition as follows (*B*): {noformat} -->[ functionName ]-->[ LPAREN ]--+------------------------>[ STAR ]----------------------+-->[ RPAREN ]--> | | +--+-->[ KW_DISTINCT ]--+--+--+-->[ expression ]--+--+--+ | | | | | | +----------->--------+ | +------[ COMMA ]<---+ | | | +------------->-----------+ {noformat} If I were to modify the grammar to make an exception for {{COUNT}} it will likely be changed to something like this (*C*): {noformat} --+---------------------------------->[ KW_COUNT ]-->[ LPAREN ]-->[ STAR ]-->[ RPAREN ]----------------+--> | | +-->[ functionName ]-->[ LPAREN ]--+--+-->[ KW_DISTINCT ]--+--+-->[ expression ]--+--+-->[ RPAREN ]--+ | | | | | | | +----------->--------+ +----[ COMMA ]<-----+ | | | +------------------------->-----------------------+ {noformat} Consider the *C* approach closely: The production that matches a {{COUNT}} invocation can be directly matched via the top branch using {{KW_COUNT}} token, or it could follow the branch below where {{functionName}} could match {{COUNT}}. On the semantic analyzer side, it makes the matching logic more complex and less intuitive since now the {{COUNT}} can be invoked via two branch conditions. For example - there would be one invocation that would directly delegate to the {{COUNT}} aggregate function, whereas another that will use the current resolver mechanism to invoke it. Instead, the approach *B* keeps the grammar consistent with the regular function invocation. It does not favor any one function over the other and simply establishes matching rules for function production. That way, the call is then delegated to the semantic analyzer which in turn matches the appropriate handling function based on the name and parameter type using the generic resolver mechanism without regard to what function is being invoked. The changes supplied in the current patch also allow individual function handlers to decide if they would like to support the {{functionName(STAR)}} syntax. Since you feel strongly about not supporting this syntax by default for any function, I can perhaps modify the {{AbstractGenericUDAFResolver}} class to raise an exception if invoked with this syntax. That way, only the functions that choose to overwrite that behavior will be able to support it. Also as you can see in the syntax diagram for (B), there is no production that will match things like {{functionName(DISTINCT STAR)}} or {{functionName(STAR, EXPR)}} etc. That said, I am open to going either way. I just wanted to put my point across so that you could consider it and make an informed decision. Let me know what you think. > count distinct on multiple columns does not work > ------------------------------------------------ > > Key: HIVE-287 > URL: https://issues.apache.org/jira/browse/HIVE-287 > Project: Hadoop Hive > Issue Type: Bug > Components: Query Processor > Reporter: Namit Jain > Assignee: Arvind Prabhakar > Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, > HIVE-287-4.patch > > > The following query does not work: > select count(distinct col1, col2) from Tbl -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.