[
https://issues.apache.org/jira/browse/HIVE-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079653#comment-17079653
]
David Mollitor commented on HIVE-23149:
---------------------------------------
OK. So, as I look at this more, it's becoming clear to me that Hive needs to
fix this starting from the top, the Parser/Grammar should be better and more
clear about handling these cases.
[https://github.com/apache/hive/blob/master/parser/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g#L212]
{code:none}
tableName
@init { gParent.pushMsg("table name", state); }
@after { gParent.popMsg(state); }
:
db=identifier DOT tab=identifier
-> ^(TOK_TABNAME $db $tab)
|
tab=identifier
-> ^(TOK_TABNAME $tab)
;
Identifier
:
(Letter | Digit) (Letter | Digit | '_')*
| {allowQuotedId()}? QuotedIdentifier /* though at the language level we
allow all Identifiers to be QuotedIdentifiers;
at the API level only columns are
allowed to be of this form */
| '`' RegexComponent+ '`'
;
identifier
:
Identifier
| nonReserved -> Identifier[$nonReserved.start]
;
{code}
An Identifier can be ASCII or back-ticked UTF-8. I don't see the backtick here
represented here.
I believe this should be something like...
{code:none}
qualifiedIdentifier:
identifier dotIdentifier? ;
dotIdentifier:
DOT identifier ;
identifier:
-- UnquotedIdentifier |
-- BackTickQuotedIdentifier |
-- DoubleQuoteQuotedIdentifier (optional) ;
tableName:
qualifiedIdentifier
;
{code}
In this way, the code receiving the tree can check for UnquotedIdentifier |
BackTickQuotedIdentifier | DoubleQuoteQuotedIdentifier | and then it later
becomes trivial (in the Hive code base) to know if the name should be stripped
based on the identifier type.
> Consistency of Parsing Object Identifiers
> -----------------------------------------
>
> Key: HIVE-23149
> URL: https://issues.apache.org/jira/browse/HIVE-23149
> Project: Hive
> Issue Type: Improvement
> Reporter: David Mollitor
> Assignee: David Mollitor
> Priority: Critical
>
> There needs to be better consistency with handling of object identifiers
> (database, tables, column, view, function, etc.). I think it makes sense to
> standardize on the same rules which MySQL/MariaDB uses for their column names
> so that Hive can be more of a drop-in replacement for these.
>
> The two important things to keep in mind are:
>
> 1// Permitted characters in quoted identifiers include the full Unicode Basic
> Multilingual Plane (BMP), except U+0000
>
> 2// If any components of a multiple-part name require quoting, quote them
> individually rather than quoting the name as a whole. For example, write
> {{`my-table`.`my-column`}}, not {{`my-table.my-column`}}.
>
> [https://dev.mysql.com/doc/refman/8.0/en/identifiers.html]
> [https://dev.mysql.com/doc/refman/8.0/en/identifier-qualifiers.html]
>
> That is to say:
>
> {code:sql}
> -- Select all rows from a table named `default.mytable`
> -- (Yes, the table name itself has a period in it. This is valid)
> SELECT * FROM `default.mytable`;
>
> -- Select all rows from database `default`, table `mytable`
> SELECT * FROM `default`.`mytable`;
> {code}
>
> This plays out in a couple of ways. There may be more, but these are the
> ones I know about already:
>
> 1// Hive generates incorrect syntax: [HIVE-23128]
>
> 2// Hive throws exception if there is a period in the table name. This is an
> invalid response. Table name may have a period in them. More likely than
> not, it will throw 'table not found' exception since the user most likely
> accidentally used backticks incorrectly and meant to specify a db and a table
> separately. [HIVE-16907]
> Once we have the parsing figured out and support for backticks to enclose
> UTF-8 strings, then the backend database needs to actually support the UTF-8
> character set. It currently does not: [HIVE-1808]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)