[
https://issues.apache.org/jira/browse/HIVE-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-29618:
----------------------------------
Labels: accuracy pull-request-available (was: accuracy)
> Vectorized LIKE Operator should match multi-line input (HIVE-22008 followup)
> ----------------------------------------------------------------------------
>
> Key: HIVE-29618
> URL: https://issues.apache.org/jira/browse/HIVE-29618
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: accuracy, pull-request-available
>
> h2. Description
> The vectorized {{LIKE}} operator's regex-based fallback
> ({{{}ComplexChecker{}}}, inner class of
> {{{}AbstractFilterStringColLikeStringScalar{}}}) compiles its pattern via
> {{Pattern.compile(pattern)}} without
> the {{Pattern.DOTALL}} flag. By Java {{Pattern}} default, {{.}} does not
> match {{\n}} (and {{.*?}} cannot consume newlines). Because the {{COMPLEX}}
> path translates the LIKE wildcard {{_}} to {{.}}
> and {{%}} to {{{}.*?{}}}, and anchors the whole regex with {{{}^...${}}}, any
> multi-line input that should match under SQL semantics is silently rejected.
> The non-vectorized counterpart {{UDFLike.evaluate}}
> ({{{}ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194{}}}) compiles
> the same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
> Pattern.DOTALL)}} and returns the correct answer. The asymmetry was
> introduced by HIVE-22008 ("LIKE Operator should match multi-line input",
> 2019-08-01), which added {{Pattern.DOTALL}} to
> {{UDFLike}} only and did not update the parallel vectorized implementation.
> The vectorized class itself was created earlier by HIVE-4642 (2013-10-03).
> The regression test added by HIVE-22008 to
> {{ql/src/test/queries/clientpositive/udf_like.q}} uses the pattern
> {{{}%withdraw%cash{}}}. The non-vectorized classifier
> {{UDFLike.parseSimplePattern}} routes a
> mid-{{{}%{}}} to {{{}COMPLEX{}}}, exercising and verifying the
> {{Pattern.DOTALL}} fix. The vectorized classifier
> {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}} routes the same
> mid-{{{}%{}}} pattern
> to {{{}CHAINED{}}}, which uses byte-substring checkers and never reaches
> {{{}ComplexChecker{}}}. The vectorized {{ComplexChecker}} path is reachable
> only via an unescaped {{_}} in the pattern, which the test
> pattern does not contain, so the vectorized regression went unobserved.
> Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{{}pom.xml{}}}
> declares version {{{}4.3.0-SNAPSHOT{}}}).
> h2. When does the bug fire?
> All of the following co-occurring conditions are required:
> * Vectorized execution is on (default).
> * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard).
> This is the only path that reaches {{{}ComplexChecker{}}}; patterns without
> {{_}} route to non-regex {{NoneChecker}} /
> {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} /
> {{{}ChainedChecker{}}}, which use byte-substring search and handle newlines
> correctly.
> * The input contains {{\n}} (or another Java line terminator that {{.}} does
> not match by default) in a position the produced regex must consume to
> satisfy its {{^...$}} anchors.
> Removing any one condition masks the bug.
> h2. Mechanism
> The classifier {{FilterStringColLikeStringScalar.UDFLikePattern.matcher}}
> returns {{COMPLEX}} as soon as it encounters an unescaped {{{}_{}}}. The
> {{COMPLEX}} enum constant's {{format}} method builds the
> regex:
> {code:java}
> "^" + UDFLike.likePatternToRegExp(pattern) + "$"{code}
> {{UDFLike.likePatternToRegExp}} translates {{_}} to {{{}.{}}}, {{%}} to
> {{{}.*?{}}}, and other characters to {{{}Pattern.quote{}}}-wrapped literals.
> For pattern {{a_b}} the produced regex is {{{}^a.b${}}}; for
> {{%first_second%}} it is {{^.{*}?first.second.{*}?$}} (literal chars are
> {{{}Pattern.quote{}}}-wrapped, elided here for readability).
> {{AbstractFilterStringColLikeStringScalar.ComplexChecker}} then compiles the
> regex at
> {{{}ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/AbstractFilterStringColLikeStringScalar.java:422{}}}:
> {code:java}
> compiledPattern = Pattern.compile(pattern);{code}
> The non-vectorized counterpart at
> {{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
> {code:java}
> p = Pattern.compile(likePatternToRegExp(strLikePattern),
> Pattern.DOTALL);{code}
> Without {{{}Pattern.DOTALL{}}}, {{.}} does not match {{{}\n{}}}. The
> {{^...$}} anchors require the regex to consume the entire input, but neither
> {{.}} (from {{{}_{}}}) nor {{.*?}} (from {{{}%{}}}) can cross a
> newline, so multi-line inputs are silently rejected.
> h2. Workaround
> {code:sql}
> SET hive.vectorized.execution.enabled=false;{code}
> If the {{_}} in the LIKE pattern is intended as a literal underscore (not a
> wildcard), escaping it ({{{}_{}}}) routes the pattern through a non-regex
> byte-substring checker and avoids the bug without
> disabling vectorization.
> h2. Reproduction
> {code:sql}
> CREATE TEMPORARY TABLE t (q STRING) STORED AS ORC;
> INSERT INTO t SELECT 'first\nsecond' UNION ALL SELECT 'first_second\nthird';
> SELECT count(*) FROM t WHERE q LIKE '%first_second%';{code}
> *Expected:* {{{}2{}}}. Under SQL semantics (established by HIVE-22008), both
> rows match {{{}LIKE '%first_second%'{}}}:
> * Row 1 {{first\nsecond}} matches because the {{_}} wildcard accepts {{\n}}
> as a single character.
> * Row 2 {{first_second\nthird}} matches because the literal substring
> {{first_second}} is present; the trailing {{\nthird}} lies outside the
> matched run and is consumed by the trailing {{{}%{}}}.
> *Actual:*
> {noformat}
> 0{noformat}
> Both rows are silently dropped, by two distinct manifestations of the same
> root cause:
> * In row 1, the {{.}} regex node (translated from {{{}_{}}}) cannot match
> {{\n}} without {{{}Pattern.DOTALL{}}}.
> * In row 2, the {{.*?}} regex node (translated from the trailing {{{}%{}}})
> cannot cross {{\n}} without {{{}Pattern.DOTALL{}}}, so the anchored {{$}} is
> unreachable.
> The same query with {{SET hive.vectorized.execution.enabled=false}} returns
> {{{}2{}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)