[
https://issues.apache.org/jira/browse/HIVE-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29618:
------------------------------------------
Description:
h2. Description
The vectorized {{LIKE}} operator's regex-based fallback ({{ComplexChecker}},
inner class of {{AbstractFilterStringColLikeStringScalar}}) compiles its
pattern via {{Pattern.compile(pattern)}} without
the {{Pattern.DOTALL}} flag. By Java {{Pattern}} default, {{.}} does not
match {{\n}} (and {{.*?}} cannot consume newlines). Because the {{COMPLEX}}
path translates the LIKE wildcard {{_}} to {{.}}
and {{%}} to {{.*?}}, and anchors the whole regex with {{^...$}}, any
multi-line input that should match under SQL semantics is silently rejected.
The non-vectorized counterpart {{UDFLike.evaluate}}
({{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the
same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
Pattern.DOTALL)}} and returns the correct answer. The asymmetry was
introduced by HIVE-22008 ("LIKE Operator should match multi-line input",
2019-08-01), which added {{Pattern.DOTALL}} to
{{UDFLike}} only and did not update the parallel vectorized implementation.
The vectorized class itself was created earlier by HIVE-4642 (2013-10-03).
The regression test added by HIVE-22008 to
{{ql/src/test/queries/clientpositive/udf_like.q}} uses the pattern
{{%withdraw%cash}}. The non-vectorized classifier
{{UDFLike.parseSimplePattern}} routes a
mid-{{%}} to {{COMPLEX}}, exercising and verifying the {{Pattern.DOTALL}}
fix. The vectorized classifier
{{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} routes the same
mid-{{%}} pattern
to {{CHAINED}}, which uses byte-substring checkers and never reaches
{{ComplexChecker}}. The vectorized {{ComplexChecker}} path is reachable only
via an unescaped {{_}} in the pattern, which the test
pattern does not contain, so the vectorized regression went unobserved.
Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{pom.xml}} declares
version {{4.3.0-SNAPSHOT}}).
h2. When does the bug fire?
All of the following co-occurring conditions are required:
* Vectorized execution is on (default).
* The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard).
This is the only path that reaches {{ComplexChecker}}; patterns without {{_}}
route to non-regex {{NoneChecker}} /
{{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{ChainedChecker}},
which use byte-substring search and handle newlines correctly.
* The input contains {{\n}} (or another Java line terminator that {{.}} does
not match by default) in a position the produced regex must consume to satisfy
its {{^...$}} anchors.
Removing any one condition masks the bug.
h2. Mechanism
The classifier {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}}
returns {{COMPLEX}} as soon as it encounters an unescaped {{_}}. The
{{COMPLEX}} enum constant's {{format}} method builds the
regex:
{code:java}
"^" + UDFLike.likePatternToRegExp(pattern) + "$"
{code}
{{UDFLike.likePatternToRegExp}} translates {{_}} to {{.}}, {{%}} to {{.*?}},
and other characters to {{Pattern.quote}}-wrapped literals. For pattern {{a_b}}
the produced regex is {{^a.b$}}; for
{{%first_second%}} it is {{^.*?first.second.*?$}} (literal chars are
{{Pattern.quote}}-wrapped, elided here for readability).
{{AbstractFilterStringColLikeStringScalar.ComplexChecker}} then compiles the
regex at
{{ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/AbstractFilterStringColLikeStringScalar.java:422}}:
{code:java}
compiledPattern = Pattern.compile(pattern);
{code}
The non-vectorized counterpart at
{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
{code:java}
p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
{code}
Without {{Pattern.DOTALL}}, {{.}} does not match {{\n}}. The {{^...$}}
anchors require the regex to consume the entire input, but neither {{.}} (from
{{_}}) nor {{.*?}} (from {{%}}) can cross a
newline, so multi-line inputs are silently rejected.
h2. Workaround
{code:sql}
SET hive.vectorized.execution.enabled=false;
{code}
If the {{_}} in the LIKE pattern is intended as a literal underscore (not a
wildcard), escaping it ({{\_}}) routes the pattern through a non-regex
byte-substring checker and avoids the bug without
disabling vectorization.
h2. Reproduction
{code:sql}
CREATE TEMPORARY TABLE t (q STRING) STORED AS ORC;
INSERT INTO t SELECT 'first\nsecond' UNION ALL SELECT 'first_second\nthird';
SELECT count(*) FROM t WHERE q LIKE '%first_second%';
{code}
*Expected:* {{2}}. Under SQL semantics (established by HIVE-22008), both rows
match {{LIKE '%first_second%'}}:
* Row 1 {{first\nsecond}} matches because the {{_}} wildcard accepts {{\n}}
as a single character.
* Row 2 {{first_second\nthird}} matches because the literal substring
{{first_second}} is present; the trailing {{\nthird}} lies outside the matched
run and is consumed by the trailing {{%}}.
*Actual:*
{noformat}
0
{noformat}
Both rows are silently dropped, by two distinct manifestations of the same
root cause:
* In row 1, the {{.}} regex node (translated from {{_}}) cannot match {{\n}}
without {{Pattern.DOTALL}}.
* In row 2, the {{.*?}} regex node (translated from the trailing {{%}})
cannot cross {{\n}} without {{Pattern.DOTALL}}, so the anchored {{$}} is
unreachable.
The same query with {{SET hive.vectorized.execution.enabled=false}} returns
{{2}}.
was:
h2. Description
The vectorized {{LIKE}} operator's {{ComplexChecker}} compiles its regex via
{{Pattern.compile(pattern)}} without the {{Pattern.DOTALL}} flag. Patterns
containing an unescaped {{_}} wildcard route
through this path. By Java {{Pattern}} default, {{.}} does not match
{{{}\n{}}}, so the anchored regex {{^.{*}?<literal>.<literal>.{*}?$}} that the
{{COMPLEX}} branch produces cannot consume newlines, and
multi-line inputs containing the literal substring are silently rejected. The
bug is silent.
The non-vectorized counterpart {{UDFLike.evaluate}}
({{{}ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194{}}}) compiles
the same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
Pattern.DOTALL)}} and returns the correct answer. The asymmetry was introduced
by HIVE-22008 ("LIKE Operator should match multi-line input", 2019-08-01),
which added {{Pattern.DOTALL}} to
{{UDFLike}} only and did not update the parallel implementation in
{{{}AbstractFilterStringColLikeStringScalar.ComplexChecker{}}}, created earlier
by HIVE-4642 (2013-10-03). The regression test added by
HIVE-22008 ({{{}ql/src/test/queries/clientpositive/udf_like.q{}}}) does not
enable vectorized execution, so the vectorized {{ComplexChecker}} path was
never exercised by it.
Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{{}pom.xml{}}}
declares version {{{}4.3.0-SNAPSHOT{}}}).
h2. When does the bug fire?
All of the following co-occurring conditions are required:
* Vectorized execution is on (default).
* The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard).
This is the only path that reaches {{{}ComplexChecker{}}}; patterns without
{{_}} route to non-regex {{NoneChecker}} /
{{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{{}ChainedChecker{}}},
which are byte-substring searches and handle newlines correctly.
* The input string contains {{\n}} (or another line terminator that {{.}} does
not match by default) outside the literal portion of the pattern; equivalently,
the {{.*?}} segments of the produced
regex would need to consume a newline to satisfy {{{}^...${}}}.
Removing any one condition masks the bug.
h2. Mechanism
The classifier
{{AbstractFilterStringColLikeStringScalar.UDFLikePattern.matcher}} returns
{{COMPLEX}} as soon as it sees an unescaped {{{}_{}}}. The {{COMPLEX}} branch
builds its regex as:
{code:java}
"^" + UDFLike.likePatternToRegExp(pattern) + "$"
{code}
{{UDFLike.likePatternToRegExp}} translates {{_}} to {{{}.{}}}, {{%}} to
{{{}.*?{}}}, and other characters to {{{}Pattern.quote{}}}-wrapped literals.
For pattern {{%information_schema%}} the produced regex
(literals elided) is {{{}^.{*}?information.schema.{*}?${}}}. {{ComplexChecker}}
then compiles it at
{code:java}
compiledPattern = Pattern.compile(pattern);
{code}
The non-vectorized counterpart at
{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
{code:java}
p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
{code}
Without {{{}Pattern.DOTALL{}}}, {{.}} (and therefore {{{}.{*}?{*}{}}}{*}) does
not match {{{}\n{}}}. The {{^...$}} anchors require the regex to consume the
entire input, but {{.}}{*}{{{}?{}}} cannot cross newlines, so any
multi-line input containing the literal substring is silently rejected.
h2. Workaround
Escape the underscore so the pattern no longer reaches {{ComplexChecker}} (it
routes through one of the byte-substring checkers instead):
{code:sql}
SELECT q LIKE '%information\_schema%' FROM t;
{code}
Or disable vectorized execution for the query:
{code:sql}
SET hive.vectorized.execution.enabled=false;
{code}
h2. Reproduction
{code:sql}
{code}
{color:#910091}TBD{color}
{code:sql}
{code}
*Expected:* all four rows return {{{}match{}}}. Each row's {{q}} contains the
literal substring {{{}information_schema{}}}, and {{%information_schema%}}
matches that substring under SQL semantics (the
literal {{_}} satisfies the {{_}} single-char wildcard).
*Actual:*
{noformat}
1 match
2 NO_MATCH
3 NO_MATCH
4 NO_MATCH
{noformat}
Only row 1 (single-line) is matched. Rows 2, 3, and 4 each contain {{\n}}
outside the {{information_schema}} run and are dropped. Running the same query
with {{SET
hive.vectorized.execution.enabled=false}} returns all four rows.
> Vectorized LIKE Operator should match multi-line input (HIVE-22008 followup)
> ----------------------------------------------------------------------------
>
> Key: HIVE-29618
> URL: https://issues.apache.org/jira/browse/HIVE-29618
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
>
> h2. Description
> The vectorized {{LIKE}} operator's regex-based fallback
> ({{ComplexChecker}}, inner class of
> {{AbstractFilterStringColLikeStringScalar}}) compiles its pattern via
> {{Pattern.compile(pattern)}} without
> the {{Pattern.DOTALL}} flag. By Java {{Pattern}} default, {{.}} does not
> match {{\n}} (and {{.*?}} cannot consume newlines). Because the {{COMPLEX}}
> path translates the LIKE wildcard {{_}} to {{.}}
> and {{%}} to {{.*?}}, and anchors the whole regex with {{^...$}}, any
> multi-line input that should match under SQL semantics is silently rejected.
> The non-vectorized counterpart {{UDFLike.evaluate}}
> ({{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the
> same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
> Pattern.DOTALL)}} and returns the correct answer. The asymmetry was
> introduced by HIVE-22008 ("LIKE Operator should match multi-line input",
> 2019-08-01), which added {{Pattern.DOTALL}} to
> {{UDFLike}} only and did not update the parallel vectorized implementation.
> The vectorized class itself was created earlier by HIVE-4642 (2013-10-03).
> The regression test added by HIVE-22008 to
> {{ql/src/test/queries/clientpositive/udf_like.q}} uses the pattern
> {{%withdraw%cash}}. The non-vectorized classifier
> {{UDFLike.parseSimplePattern}} routes a
> mid-{{%}} to {{COMPLEX}}, exercising and verifying the {{Pattern.DOTALL}}
> fix. The vectorized classifier
> {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} routes the same
> mid-{{%}} pattern
> to {{CHAINED}}, which uses byte-substring checkers and never reaches
> {{ComplexChecker}}. The vectorized {{ComplexChecker}} path is reachable only
> via an unescaped {{_}} in the pattern, which the test
> pattern does not contain, so the vectorized regression went unobserved.
> Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{pom.xml}}
> declares version {{4.3.0-SNAPSHOT}}).
> h2. When does the bug fire?
> All of the following co-occurring conditions are required:
> * Vectorized execution is on (default).
> * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard).
> This is the only path that reaches {{ComplexChecker}}; patterns without {{_}}
> route to non-regex {{NoneChecker}} /
> {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{ChainedChecker}},
> which use byte-substring search and handle newlines correctly.
> * The input contains {{\n}} (or another Java line terminator that {{.}}
> does not match by default) in a position the produced regex must consume to
> satisfy its {{^...$}} anchors.
> Removing any one condition masks the bug.
> h2. Mechanism
> The classifier {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}}
> returns {{COMPLEX}} as soon as it encounters an unescaped {{_}}. The
> {{COMPLEX}} enum constant's {{format}} method builds the
> regex:
> {code:java}
> "^" + UDFLike.likePatternToRegExp(pattern) + "$"
> {code}
> {{UDFLike.likePatternToRegExp}} translates {{_}} to {{.}}, {{%}} to
> {{.*?}}, and other characters to {{Pattern.quote}}-wrapped literals. For
> pattern {{a_b}} the produced regex is {{^a.b$}}; for
> {{%first_second%}} it is {{^.*?first.second.*?$}} (literal chars are
> {{Pattern.quote}}-wrapped, elided here for readability).
> {{AbstractFilterStringColLikeStringScalar.ComplexChecker}} then compiles
> the regex at
>
> {{ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/AbstractFilterStringColLikeStringScalar.java:422}}:
> {code:java}
> compiledPattern = Pattern.compile(pattern);
> {code}
> The non-vectorized counterpart at
> {{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
> {code:java}
> p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
> {code}
> Without {{Pattern.DOTALL}}, {{.}} does not match {{\n}}. The {{^...$}}
> anchors require the regex to consume the entire input, but neither {{.}}
> (from {{_}}) nor {{.*?}} (from {{%}}) can cross a
> newline, so multi-line inputs are silently rejected.
> h2. Workaround
> {code:sql}
> SET hive.vectorized.execution.enabled=false;
> {code}
> If the {{_}} in the LIKE pattern is intended as a literal underscore (not a
> wildcard), escaping it ({{\_}}) routes the pattern through a non-regex
> byte-substring checker and avoids the bug without
> disabling vectorization.
> h2. Reproduction
> {code:sql}
> CREATE TEMPORARY TABLE t (q STRING) STORED AS ORC;
> INSERT INTO t SELECT 'first\nsecond' UNION ALL SELECT 'first_second\nthird';
> SELECT count(*) FROM t WHERE q LIKE '%first_second%';
> {code}
> *Expected:* {{2}}. Under SQL semantics (established by HIVE-22008), both
> rows match {{LIKE '%first_second%'}}:
> * Row 1 {{first\nsecond}} matches because the {{_}} wildcard accepts {{\n}}
> as a single character.
> * Row 2 {{first_second\nthird}} matches because the literal substring
> {{first_second}} is present; the trailing {{\nthird}} lies outside the
> matched run and is consumed by the trailing {{%}}.
> *Actual:*
> {noformat}
> 0
> {noformat}
> Both rows are silently dropped, by two distinct manifestations of the same
> root cause:
> * In row 1, the {{.}} regex node (translated from {{_}}) cannot match
> {{\n}} without {{Pattern.DOTALL}}.
> * In row 2, the {{.*?}} regex node (translated from the trailing {{%}})
> cannot cross {{\n}} without {{Pattern.DOTALL}}, so the anchored {{$}} is
> unreachable.
> The same query with {{SET hive.vectorized.execution.enabled=false}} returns
> {{2}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)