Remove REGEX Column Specification

David Mollitor Mon, 13 Apr 2020 06:57:04 -0700

Hello Gang,

I've been tracking a lot of issues recently regarding qualified tables
names, qualified table names, table names using back ticks, and other
similar circumstances.

I've looked into trying to address some of these and noted that these issue
goes way back and are go all the way down to the core of Hive.

To start with, I wanted to use the ANTLR grammar to address some of these
issues and to standardize behavior across all queries. For example, there
is currently a patch that disallows table names from having a 'dot' in the
name. I'm not 100% sure it applies to all queries, so I wanted to codify
this restriction in the parser grammar. So it got me looking at the
grammar.

In parallel, I also tried to build a supplemental parser in Java for
parsing table names (HIVE-23150) and I was hitting some weird, and
confusing, edge cases bubbling up from the parser. I eventually traced it
back to the fact that there are a lot of weird rules around table names in
the grammar including something called "REGEX Column Specification."

This feature is problematic as it blindly labels most table names as being
a regex. It really should only apply to column names, but the grammar
defines a table name as also possibly being a regex. There is a lot of
ambiguity because a table named "a" could be a literal value or a legal
regex. When a table name is defined as a regex, a different code path is
taken from when a table name is considered to be a literal value. Where I
first saw this issue was in a qtest where a table name `s/c` was producing
a different result than a table named `s+c`.

This regex feature is not something I've seen in MySQL or Postgres. In
MySQL, any table name surrounded with a back tick can be just about any
UTF-8 character, so it's not really feasible to tell, without some kind of
SQL hint, that this table name is a regex or a literal value.

This feature adds a lot of ambiguity and complexity, it is not supported by
other major RDBMS, and it adds only very minor benefit. I also hope to
move Hive in a direction of fully supporting UTF-8.

I have put a patch up to remove it:
https://issues.apache.org/jira/browse/HIVE-23183

References:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification

https://dev.mysql.com/doc/refman/8.0/en/identifiers.html

Thanks,
David

Remove REGEX Column Specification

Reply via email to