[ 
https://issues.apache.org/jira/browse/HIVE-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29618:
------------------------------------------
    Description: 
h2. Description

The vectorized {{LIKE}} operator's regex-based fallback 
({{{}ComplexChecker{}}}, inner class of 
{{{}AbstractFilterStringColLikeStringScalar{}}}) compiles its pattern via 
{{Pattern.compile(pattern)}} without
the {{Pattern.DOTALL}} flag. By Java {{Pattern}} default, {{.}} does not match 
{{\n}} (and {{.*?}} cannot consume newlines). Because the {{COMPLEX}} path 
translates the LIKE wildcard {{_}} to {{.}}
and {{%}} to {{{}.*?{}}}, and anchors the whole regex with {{{}^...${}}}, any 
multi-line input that should match under SQL semantics is silently rejected.

The non-vectorized counterpart {{UDFLike.evaluate}} 
({{{}ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194{}}}) compiles 
the same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
Pattern.DOTALL)}} and returns the correct answer. The asymmetry was introduced 
by HIVE-22008 ("LIKE Operator should match multi-line input", 2019-08-01), 
which added {{Pattern.DOTALL}} to
{{UDFLike}} only and did not update the parallel vectorized implementation. The 
vectorized class itself was created earlier by HIVE-4642 (2013-10-03).

The regression test added by HIVE-22008 to 
{{ql/src/test/queries/clientpositive/udf_like.q}} uses the pattern 
{{{}%withdraw%cash{}}}. The non-vectorized classifier 
{{UDFLike.parseSimplePattern}} routes a
mid-{{{}%{}}} to {{{}COMPLEX{}}}, exercising and verifying the 
{{Pattern.DOTALL}} fix. The vectorized classifier 
{{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} routes the same 
mid-{{{}%{}}} pattern
to {{{}CHAINED{}}}, which uses byte-substring checkers and never reaches 
{{{}ComplexChecker{}}}. The vectorized {{ComplexChecker}} path is reachable 
only via an unescaped {{_}} in the pattern, which the test
pattern does not contain, so the vectorized regression went unobserved.

Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{{}pom.xml{}}} 
declares version {{{}4.3.0-SNAPSHOT{}}}).
h2. When does the bug fire?

All of the following co-occurring conditions are required:
 * Vectorized execution is on (default).
 * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard). 
This is the only path that reaches {{{}ComplexChecker{}}}; patterns without 
{{_}} route to non-regex {{NoneChecker}} /
{{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{{}ChainedChecker{}}}, 
which use byte-substring search and handle newlines correctly.
 * The input contains {{\n}} (or another Java line terminator that {{.}} does 
not match by default) in a position the produced regex must consume to satisfy 
its {{^...$}} anchors.

Removing any one condition masks the bug.
h2. Mechanism

The classifier {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} 
returns {{COMPLEX}} as soon as it encounters an unescaped {{{}_{}}}. The 
{{COMPLEX}} enum constant's {{format}} method builds the
regex:
{code:java}
  "^" + UDFLike.likePatternToRegExp(pattern) + "$"{code}
{{UDFLike.likePatternToRegExp}} translates {{_}} to {{{}.{}}}, {{%}} to 
{{{}.*?{}}}, and other characters to {{{}Pattern.quote{}}}-wrapped literals. 
For pattern {{a_b}} the produced regex is {{{}^a.b${}}}; for
{{%first_second%}} it is {{^.{*}?first.second.{*}?$}} (literal chars are 
{{{}Pattern.quote{}}}-wrapped, elided here for readability).

{{AbstractFilterStringColLikeStringScalar.ComplexChecker}} then compiles the 
regex at
{{{}ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/AbstractFilterStringColLikeStringScalar.java:422{}}}:
{code:java}
  compiledPattern = Pattern.compile(pattern);{code}
The non-vectorized counterpart at 
{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
{code:java}
  p = Pattern.compile(likePatternToRegExp(strLikePattern), 
Pattern.DOTALL);{code}
Without {{{}Pattern.DOTALL{}}}, {{.}} does not match {{{}\n{}}}. The {{^...$}} 
anchors require the regex to consume the entire input, but neither {{.}} (from 
{{{}_{}}}) nor {{.*?}} (from {{{}%{}}}) can cross a
newline, so multi-line inputs are silently rejected.
h2. Workaround
{code:sql}
  SET hive.vectorized.execution.enabled=false;{code}
If the {{_}} in the LIKE pattern is intended as a literal underscore (not a 
wildcard), escaping it ({{{}_{}}}) routes the pattern through a non-regex 
byte-substring checker and avoids the bug without
disabling vectorization.
h2. Reproduction
{code:sql}
  CREATE TEMPORARY TABLE t (q STRING) STORED AS ORC;
  INSERT INTO t SELECT 'first\nsecond' UNION ALL SELECT 'first_second\nthird';
  SELECT count(*) FROM t WHERE q LIKE '%first_second%';{code}
*Expected:* {{{}2{}}}. Under SQL semantics (established by HIVE-22008), both 
rows match {{{}LIKE '%first_second%'{}}}:
 * Row 1 {{first\nsecond}} matches because the {{_}} wildcard accepts {{\n}} as 
a single character.
 * Row 2 {{first_second\nthird}} matches because the literal substring 
{{first_second}} is present; the trailing {{\nthird}} lies outside the matched 
run and is consumed by the trailing {{{}%{}}}.

*Actual:*
{noformat}
  0{noformat}
Both rows are silently dropped, by two distinct manifestations of the same root 
cause:
 * In row 1, the {{.}} regex node (translated from {{{}_{}}}) cannot match 
{{\n}} without {{{}Pattern.DOTALL{}}}.
 * In row 2, the {{.*?}} regex node (translated from the trailing {{{}%{}}}) 
cannot cross {{\n}} without {{{}Pattern.DOTALL{}}}, so the anchored {{$}} is 
unreachable.

The same query with {{SET hive.vectorized.execution.enabled=false}} returns 
{{{}2{}}}.

  was:
 h2. Description

  The vectorized {{LIKE}} operator's regex-based fallback ({{ComplexChecker}}, 
inner class of {{AbstractFilterStringColLikeStringScalar}}) compiles its 
pattern via {{Pattern.compile(pattern)}} without
  the {{Pattern.DOTALL}} flag. By Java {{Pattern}} default, {{.}} does not 
match {{\n}} (and {{.*?}} cannot consume newlines). Because the {{COMPLEX}} 
path translates the LIKE wildcard {{_}} to {{.}}
  and {{%}} to {{.*?}}, and anchors the whole regex with {{^...$}}, any 
multi-line input that should match under SQL semantics is silently rejected.

  The non-vectorized counterpart {{UDFLike.evaluate}} 
({{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the 
same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
   Pattern.DOTALL)}} and returns the correct answer. The asymmetry was 
introduced by HIVE-22008 ("LIKE Operator should match multi-line input", 
2019-08-01), which added {{Pattern.DOTALL}} to
  {{UDFLike}} only and did not update the parallel vectorized implementation. 
The vectorized class itself was created earlier by HIVE-4642 (2013-10-03).

  The regression test added by HIVE-22008 to 
{{ql/src/test/queries/clientpositive/udf_like.q}} uses the pattern 
{{%withdraw%cash}}. The non-vectorized classifier 
{{UDFLike.parseSimplePattern}} routes a
   mid-{{%}} to {{COMPLEX}}, exercising and verifying the {{Pattern.DOTALL}} 
fix. The vectorized classifier 
{{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} routes the same 
mid-{{%}} pattern
  to {{CHAINED}}, which uses byte-substring checkers and never reaches 
{{ComplexChecker}}. The vectorized {{ComplexChecker}} path is reachable only 
via an unescaped {{_}} in the pattern, which the test
   pattern does not contain, so the vectorized regression went unobserved.

  Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{pom.xml}} declares 
version {{4.3.0-SNAPSHOT}}).

  h2. When does the bug fire?

  All of the following co-occurring conditions are required:

  * Vectorized execution is on (default).
  * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard). 
This is the only path that reaches {{ComplexChecker}}; patterns without {{_}} 
route to non-regex {{NoneChecker}} /
  {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{ChainedChecker}}, 
which use byte-substring search and handle newlines correctly.
  * The input contains {{\n}} (or another Java line terminator that {{.}} does 
not match by default) in a position the produced regex must consume to satisfy 
its {{^...$}} anchors.

  Removing any one condition masks the bug.

  h2. Mechanism

  The classifier {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} 
returns {{COMPLEX}} as soon as it encounters an unescaped {{_}}. The 
{{COMPLEX}} enum constant's {{format}} method builds the
   regex:

  {code:java}
  "^" + UDFLike.likePatternToRegExp(pattern) + "$"
  {code}

  {{UDFLike.likePatternToRegExp}} translates {{_}} to {{.}}, {{%}} to {{.*?}}, 
and other characters to {{Pattern.quote}}-wrapped literals. For pattern {{a_b}} 
the produced regex is {{^a.b$}}; for
  {{%first_second%}} it is {{^.*?first.second.*?$}} (literal chars are 
{{Pattern.quote}}-wrapped, elided here for readability).

  {{AbstractFilterStringColLikeStringScalar.ComplexChecker}} then compiles the 
regex at
  
{{ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/AbstractFilterStringColLikeStringScalar.java:422}}:

  {code:java}
  compiledPattern = Pattern.compile(pattern);
  {code}

  The non-vectorized counterpart at 
{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:

  {code:java}
  p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
  {code}

  Without {{Pattern.DOTALL}}, {{.}} does not match {{\n}}. The {{^...$}} 
anchors require the regex to consume the entire input, but neither {{.}} (from 
{{_}}) nor {{.*?}} (from {{%}}) can cross a
  newline, so multi-line inputs are silently rejected.

  h2. Workaround

  {code:sql}
  SET hive.vectorized.execution.enabled=false;
  {code}

  If the {{_}} in the LIKE pattern is intended as a literal underscore (not a 
wildcard), escaping it ({{\_}}) routes the pattern through a non-regex 
byte-substring checker and avoids the bug without
  disabling vectorization.

  h2. Reproduction

  {code:sql}
  CREATE TEMPORARY TABLE t (q STRING) STORED AS ORC;
  INSERT INTO t SELECT 'first\nsecond' UNION ALL SELECT 'first_second\nthird';
  SELECT count(*) FROM t WHERE q LIKE '%first_second%';
  {code}

  *Expected:* {{2}}. Under SQL semantics (established by HIVE-22008), both rows 
match {{LIKE '%first_second%'}}:

  * Row 1 {{first\nsecond}} matches because the {{_}} wildcard accepts {{\n}} 
as a single character.
  * Row 2 {{first_second\nthird}} matches because the literal substring 
{{first_second}} is present; the trailing {{\nthird}} lies outside the matched 
run and is consumed by the trailing {{%}}.

  *Actual:*

  {noformat}
  0
  {noformat}

  Both rows are silently dropped, by two distinct manifestations of the same 
root cause:

  * In row 1, the {{.}} regex node (translated from {{_}}) cannot match {{\n}} 
without {{Pattern.DOTALL}}.
  * In row 2, the {{.*?}} regex node (translated from the trailing {{%}}) 
cannot cross {{\n}} without {{Pattern.DOTALL}}, so the anchored {{$}} is 
unreachable.

  The same query with {{SET hive.vectorized.execution.enabled=false}} returns 
{{2}}.


> Vectorized LIKE Operator should match multi-line input (HIVE-22008 followup)
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-29618
>                 URL: https://issues.apache.org/jira/browse/HIVE-29618
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>
> h2. Description
> The vectorized {{LIKE}} operator's regex-based fallback 
> ({{{}ComplexChecker{}}}, inner class of 
> {{{}AbstractFilterStringColLikeStringScalar{}}}) compiles its pattern via 
> {{Pattern.compile(pattern)}} without
> the {{Pattern.DOTALL}} flag. By Java {{Pattern}} default, {{.}} does not 
> match {{\n}} (and {{.*?}} cannot consume newlines). Because the {{COMPLEX}} 
> path translates the LIKE wildcard {{_}} to {{.}}
> and {{%}} to {{{}.*?{}}}, and anchors the whole regex with {{{}^...${}}}, any 
> multi-line input that should match under SQL semantics is silently rejected.
> The non-vectorized counterpart {{UDFLike.evaluate}} 
> ({{{}ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194{}}}) compiles 
> the same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
> Pattern.DOTALL)}} and returns the correct answer. The asymmetry was 
> introduced by HIVE-22008 ("LIKE Operator should match multi-line input", 
> 2019-08-01), which added {{Pattern.DOTALL}} to
> {{UDFLike}} only and did not update the parallel vectorized implementation. 
> The vectorized class itself was created earlier by HIVE-4642 (2013-10-03).
> The regression test added by HIVE-22008 to 
> {{ql/src/test/queries/clientpositive/udf_like.q}} uses the pattern 
> {{{}%withdraw%cash{}}}. The non-vectorized classifier 
> {{UDFLike.parseSimplePattern}} routes a
> mid-{{{}%{}}} to {{{}COMPLEX{}}}, exercising and verifying the 
> {{Pattern.DOTALL}} fix. The vectorized classifier 
> {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} routes the same 
> mid-{{{}%{}}} pattern
> to {{{}CHAINED{}}}, which uses byte-substring checkers and never reaches 
> {{{}ComplexChecker{}}}. The vectorized {{ComplexChecker}} path is reachable 
> only via an unescaped {{_}} in the pattern, which the test
> pattern does not contain, so the vectorized regression went unobserved.
> Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{{}pom.xml{}}} 
> declares version {{{}4.3.0-SNAPSHOT{}}}).
> h2. When does the bug fire?
> All of the following co-occurring conditions are required:
>  * Vectorized execution is on (default).
>  * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard). 
> This is the only path that reaches {{{}ComplexChecker{}}}; patterns without 
> {{_}} route to non-regex {{NoneChecker}} /
> {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / 
> {{{}ChainedChecker{}}}, which use byte-substring search and handle newlines 
> correctly.
>  * The input contains {{\n}} (or another Java line terminator that {{.}} does 
> not match by default) in a position the produced regex must consume to 
> satisfy its {{^...$}} anchors.
> Removing any one condition masks the bug.
> h2. Mechanism
> The classifier {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} 
> returns {{COMPLEX}} as soon as it encounters an unescaped {{{}_{}}}. The 
> {{COMPLEX}} enum constant's {{format}} method builds the
> regex:
> {code:java}
>   "^" + UDFLike.likePatternToRegExp(pattern) + "$"{code}
> {{UDFLike.likePatternToRegExp}} translates {{_}} to {{{}.{}}}, {{%}} to 
> {{{}.*?{}}}, and other characters to {{{}Pattern.quote{}}}-wrapped literals. 
> For pattern {{a_b}} the produced regex is {{{}^a.b${}}}; for
> {{%first_second%}} it is {{^.{*}?first.second.{*}?$}} (literal chars are 
> {{{}Pattern.quote{}}}-wrapped, elided here for readability).
> {{AbstractFilterStringColLikeStringScalar.ComplexChecker}} then compiles the 
> regex at
> {{{}ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/AbstractFilterStringColLikeStringScalar.java:422{}}}:
> {code:java}
>   compiledPattern = Pattern.compile(pattern);{code}
> The non-vectorized counterpart at 
> {{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
> {code:java}
>   p = Pattern.compile(likePatternToRegExp(strLikePattern), 
> Pattern.DOTALL);{code}
> Without {{{}Pattern.DOTALL{}}}, {{.}} does not match {{{}\n{}}}. The 
> {{^...$}} anchors require the regex to consume the entire input, but neither 
> {{.}} (from {{{}_{}}}) nor {{.*?}} (from {{{}%{}}}) can cross a
> newline, so multi-line inputs are silently rejected.
> h2. Workaround
> {code:sql}
>   SET hive.vectorized.execution.enabled=false;{code}
> If the {{_}} in the LIKE pattern is intended as a literal underscore (not a 
> wildcard), escaping it ({{{}_{}}}) routes the pattern through a non-regex 
> byte-substring checker and avoids the bug without
> disabling vectorization.
> h2. Reproduction
> {code:sql}
>   CREATE TEMPORARY TABLE t (q STRING) STORED AS ORC;
>   INSERT INTO t SELECT 'first\nsecond' UNION ALL SELECT 'first_second\nthird';
>   SELECT count(*) FROM t WHERE q LIKE '%first_second%';{code}
> *Expected:* {{{}2{}}}. Under SQL semantics (established by HIVE-22008), both 
> rows match {{{}LIKE '%first_second%'{}}}:
>  * Row 1 {{first\nsecond}} matches because the {{_}} wildcard accepts {{\n}} 
> as a single character.
>  * Row 2 {{first_second\nthird}} matches because the literal substring 
> {{first_second}} is present; the trailing {{\nthird}} lies outside the 
> matched run and is consumed by the trailing {{{}%{}}}.
> *Actual:*
> {noformat}
>   0{noformat}
> Both rows are silently dropped, by two distinct manifestations of the same 
> root cause:
>  * In row 1, the {{.}} regex node (translated from {{{}_{}}}) cannot match 
> {{\n}} without {{{}Pattern.DOTALL{}}}.
>  * In row 2, the {{.*?}} regex node (translated from the trailing {{{}%{}}}) 
> cannot cross {{\n}} without {{{}Pattern.DOTALL{}}}, so the anchored {{$}} is 
> unreachable.
> The same query with {{SET hive.vectorized.execution.enabled=false}} returns 
> {{{}2{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to