[
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884580#action_12884580
]
Arvind Prabhakar commented on HIVE-287:
---------------------------------------
Thanks for the explanation John. The SQL BNF that you pointed out is the
normative SQL specification. I do not think any SQL implementations use this
grammar though. The parallel is that of an interface and its implementation.
While the interface can be short and precise, the implementations may choose to
delegate interface methods to other implementation specific methods. Similarly,
most databases deal with their own SQL grammar that is compliant with the SQL
standard at specific levels.
More to the point in Hive - my key concern is that by modifying the grammar to
make an exception for {{COUNT}}, we will be introducing a brittle coupling
between the the parser and semantic analyzer. Right now the count aggregate
function is treated like any other function and is thus part of the general
framework. By making this change, we will be modifying it to be specifically
associated from with the grammar directives.
This is the current function definition in Hive QL grammar (*A*):
{noformat}
-->[ functionName ]-->[ LPAREN ]--+-->[ KW_DISTINCT ]--+--+--+-->[ expression
]--+--+-->[ RPAREN ]-->
| | | |
| |
+----------->--------+ | +------[ COMMA
]<---+ |
|
|
+------------->-----------+
{noformat}
The patch that I have supplied already on this Jira modifies this definition as
follows (*B*):
{noformat}
-->[ functionName ]-->[ LPAREN ]--+------------------------>[ STAR
]----------------------+-->[ RPAREN ]-->
|
|
+--+-->[ KW_DISTINCT ]--+--+--+-->[
expression ]--+--+--+
| | | |
| |
+----------->--------+ | +------[ COMMA
]<---+ |
|
|
+------------->-----------+
{noformat}
If I were to modify the grammar to make an exception for {{COUNT}} it will
likely be changed to something like this (*C*):
{noformat}
--+---------------------------------->[ KW_COUNT ]-->[ LPAREN ]-->[ STAR ]-->[
RPAREN ]----------------+-->
|
|
+-->[ functionName ]-->[ LPAREN ]--+--+-->[ KW_DISTINCT ]--+--+-->[
expression ]--+--+-->[ RPAREN ]--+
| | | |
| |
| +----------->--------+ +----[ COMMA
]<-----+ |
|
|
+------------------------->-----------------------+
{noformat}
Consider the *C* approach closely: The production that matches a {{COUNT}}
invocation can be directly matched via the top branch using {{KW_COUNT}} token,
or it could follow the branch below where {{functionName}} could match
{{COUNT}}. On the semantic analyzer side, it makes the matching logic more
complex and less intuitive since now the {{COUNT}} can be invoked via two
branch conditions. For example - there would be one invocation that would
directly delegate to the {{COUNT}} aggregate function, whereas another that
will use the current resolver mechanism to invoke it.
Instead, the approach *B* keeps the grammar consistent with the regular
function invocation. It does not favor any one function over the other and
simply establishes matching rules for function production. That way, the call
is then delegated to the semantic analyzer which in turn matches the
appropriate handling function based on the name and parameter type using the
generic resolver mechanism without regard to what function is being invoked.
The changes supplied in the current patch also allow individual function
handlers to decide if they would like to support the {{functionName(STAR)}}
syntax. Since you feel strongly about not supporting this syntax by default for
any function, I can perhaps modify the {{AbstractGenericUDAFResolver}} class to
raise an exception if invoked with this syntax. That way, only the functions
that choose to overwrite that behavior will be able to support it. Also as you
can see in the syntax diagram for (B), there is no production that will match
things like {{functionName(DISTINCT STAR)}} or {{functionName(STAR, EXPR)}} etc.
That said, I am open to going either way. I just wanted to put my point across
so that you could consider it and make an informed decision. Let me know what
you think.
> count distinct on multiple columns does not work
> ------------------------------------------------
>
> Key: HIVE-287
> URL: https://issues.apache.org/jira/browse/HIVE-287
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: Namit Jain
> Assignee: Arvind Prabhakar
> Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch,
> HIVE-287-4.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.