David Mollitor created HIVE-23172:
-------------------------------------

             Summary: Quoted Backtick Columns Are Not Parsing Correctly
                 Key: HIVE-23172
                 URL: https://issues.apache.org/jira/browse/HIVE-23172
             Project: Hive
          Issue Type: Improvement
            Reporter: David Mollitor
            Assignee: David Mollitor


I recently came across a weird behavior while examining failures of 
{{special_character_in_tabnames_2.q}} while working on HIVE-23150. I was 
surprised to see it fail because I couldn't see of any reason why it should... 
it's doing pretty standard SQL statements just like every other test, but for 
some reason this test is just a *little bit* differently than most others and 
it brought this issue to light.

Turns out,... the parsing of table names is pretty much wrong across the board.

The statement that caught my attention was this:
{code:sql}
DROP TABLE IF EXISTS `s/c`;
{code}
And here is the relevant grammar:
{code:none}
fragment
RegexComponent
    : 'a'..'z' | 'A'..'Z' | '0'..'9' | '_'
    | PLUS | STAR | QUESTION | MINUS | DOT
    | LPAREN | RPAREN | LSQUARE | RSQUARE | LCURLY | RCURLY
    | BITWISEXOR | BITWISEOR | DOLLAR | '!'
    ;

Identifier
    :
    (Letter | Digit) (Letter | Digit | '_')*
    | {allowQuotedId()}? QuotedIdentifier  /* though at the language level we 
allow all Identifiers to be QuotedIdentifiers; 
                                              at the API level only columns are 
allowed to be of this form */
    | '`' RegexComponent+ '`'
    ;

fragment    
QuotedIdentifier 
    :
    '`'  ( '``' | ~('`') )* '`' { 
setText(StringUtils.replace(getText().substring(1, getText().length() -1 ), 
"``", "`")); }
    ;
{code}
The mystery for me was that, for some reason, this String {{`s/c`}} was being 
stripped of its back-ticks. Every other test I investigated did not have this 
behavior, the back ticks were always preserved around the table name. The main 
Hive Java code base would see the back-ticks and deal with it internally. For 
HIVE-23150, I introduced some sanity checks and they were failing because they 
were expecting the back ticks to be present.

With the help of HIVE-23171 I finally figured it out. So, what I discovered is 
that pretty much every table name is hitting the {{RegexComponent}} rule and 
the back ticks are carried forward. However, {{`s/c`}} the forward slash `/` is 
not allowable in {{RegexComponent}} so it hits on {{QuotedIdentifier}} rule 
which is trimming the back ticks.

I validated this by disabling {{QuotedIdentifier}}. When I did this, {{`s/c`}} 
fails in error but {{`sc`}} parses successfully... because {{`sc`}} is being 
treated as a {{RegexComponent}}.

So, if you have {{allowQuotedId}} disabled, table names can only use the 
characters defined in the {{RegexComponent}} rule (otherwise it errors), and it 
will *not* strip the back ticks. If you have {{allowQuotedId}} enabled, then if 
the table name has a character not specified in {{RegexComponent}}, it will 
identify it as a table name and it *will* strip the back ticks, if all the 
characters are part of {{RegexComponent}} then it will *not* strip the back 
ticks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to