[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Arvind Prabhakar (JIRA) Fri, 02 Jul 2010 00:28:23 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884580#action_12884580
 ]


Arvind Prabhakar commented on HIVE-287:
---------------------------------------

Thanks for the explanation John. The SQL BNF that you pointed out is the 
normative SQL specification. I do not think any SQL implementations use this 
grammar though. The parallel is that of an interface and its implementation. 
While the interface can be short and precise, the implementations may choose to 
delegate interface methods to other implementation specific methods. Similarly, 
most databases deal with their own SQL grammar that is compliant with the SQL 
standard at specific levels.

More to the point in Hive - my key concern is that by modifying the grammar to 
make an exception for {{COUNT}}, we will be introducing a brittle coupling 
between the the parser and semantic analyzer. Right now the count aggregate 
function is treated like any other function and is thus part of the general 
framework. By making this change, we will be modifying it to be specifically 
associated from with the grammar directives. 

This is the current function definition in Hive QL grammar (*A*):

{noformat}
-->[ functionName ]-->[ LPAREN ]--+-->[ KW_DISTINCT ]--+--+--+-->[ expression 
]--+--+-->[ RPAREN ]-->
                                  |                    |  |  |                  
 |  |
                                  +----------->--------+  |  +------[ COMMA 
]<---+  |
                                                          |                     
    |
                                                          
+------------->-----------+
{noformat}

The patch that I have supplied already on this Jira modifies this definition as 
follows (*B*):

{noformat}

-->[ functionName ]-->[ LPAREN ]--+------------------------>[ STAR 
]----------------------+-->[ RPAREN ]-->
                                  |                                             
          |
                                  +--+-->[ KW_DISTINCT ]--+--+--+-->[ 
expression ]--+--+--+
                                     |                    |  |  |               
    |  |
                                     +----------->--------+  |  +------[ COMMA 
]<---+  |
                                                             |                  
       |
                                                             
+------------->-----------+
{noformat}

If I were to modify the grammar to make an exception for {{COUNT}} it will 
likely be changed to something like this (*C*):

{noformat}

--+---------------------------------->[ KW_COUNT ]-->[ LPAREN ]-->[ STAR ]-->[ 
RPAREN ]----------------+-->
  |                                                                             
                       |
  +-->[ functionName ]-->[ LPAREN ]--+--+-->[ KW_DISTINCT ]--+--+-->[ 
expression ]--+--+-->[ RPAREN ]--+
                                     |  |                    |  |               
    |  |
                                     |  +----------->--------+  +----[ COMMA 
]<-----+  |
                                     |                                          
       |
                                     
+------------------------->-----------------------+

{noformat}

Consider the *C* approach closely: The production that matches a {{COUNT}} 
invocation can be directly matched via the top branch using {{KW_COUNT}} token, 
or it could follow the branch below where {{functionName}} could match 
{{COUNT}}. On the semantic analyzer side, it makes the matching logic more 
complex and less intuitive since now the {{COUNT}} can be invoked via two 
branch conditions. For example - there would be one invocation that would 
directly delegate to the {{COUNT}} aggregate function, whereas another that 
will use the current resolver mechanism to invoke it.

Instead, the approach *B* keeps the grammar consistent with the regular 
function invocation. It does not favor any one function over the other and 
simply establishes matching rules for function production. That way, the call 
is then delegated to the semantic analyzer which in turn matches the 
appropriate handling function based on the name and parameter type using the 
generic resolver mechanism without regard to what function is being invoked. 

The changes supplied in the current patch also allow individual function 
handlers to decide if they would like to support the {{functionName(STAR)}} 
syntax. Since you feel strongly about not supporting this syntax by default for 
any function, I can perhaps modify the {{AbstractGenericUDAFResolver}} class to 
raise an exception if invoked with this syntax. That way, only the functions 
that choose to overwrite that behavior will be able to support it. Also as you 
can see in the syntax diagram for (B), there is no production that will match 
things like {{functionName(DISTINCT STAR)}} or {{functionName(STAR, EXPR)}} etc.

That said, I am open to going either way. I just wanted to put my point across 
so that you could consider it and make an informed decision. Let me know what 
you think.



> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, 
> HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Reply via email to