[
https://issues.apache.org/jira/browse/HIVE-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HIVE-29618 started by Konstantin Bereznyakov.
-----------------------------------------------------
> Vectorized LIKE Operator should match multi-line input (HIVE-22008 followup)
> ----------------------------------------------------------------------------
>
> Key: HIVE-29618
> URL: https://issues.apache.org/jira/browse/HIVE-29618
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
>
> h2. Description
> The vectorized {{LIKE}} operator's {{ComplexChecker}} compiles its regex
> via {{Pattern.compile(pattern)}} without the {{Pattern.DOTALL}} flag.
> Patterns containing an unescaped {{_}} wildcard route
> through this path. By Java {{Pattern}} default, {{.}} does not match
> {{\n}}, so the anchored regex {{^.*?<literal>.<literal>.*?$}} that the
> {{COMPLEX}} branch produces cannot consume newlines, and
> multi-line inputs containing the literal substring are silently rejected.
> The bug is silent.
> The non-vectorized counterpart {{UDFLike.evaluate}}
> ({{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the
> same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
> Pattern.DOTALL)}} and returns the correct answer. The asymmetry was
> introduced by HIVE-22008 ("LIKE Operator should match multi-line input",
> 2019-08-01), which added {{Pattern.DOTALL}} to
> {{UDFLike}} only and did not update the parallel implementation in
> {{AbstractFilterStringColLikeStringScalar.ComplexChecker}}, created earlier
> by HIVE-4642 (2013-10-03). The regression test added by
> HIVE-22008 ({{ql/src/test/queries/clientpositive/udf_like.q}}) does not
> enable vectorized execution, so the vectorized {{ComplexChecker}} path was
> never exercised by it.
> Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{pom.xml}}
> declares version {{4.3.0-SNAPSHOT}}).
> h2. When does the bug fire?
> All of the following co-occurring conditions are required:
> * Vectorized execution is on (default).
> * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard).
> This is the only path that reaches {{ComplexChecker}}; patterns without {{_}}
> route to non-regex {{NoneChecker}} /
> {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{ChainedChecker}},
> which are byte-substring searches and handle newlines correctly.
> * The input string contains {{\n}} (or another line terminator that {{.}}
> does not match by default) outside the literal portion of the pattern;
> equivalently, the {{.*?}} segments of the produced
> regex would need to consume a newline to satisfy {{^...$}}.
> Removing any one condition masks the bug.
> h2. Mechanism
> The classifier
> {{AbstractFilterStringColLikeStringScalar.UDFLikePattern.matcher}} returns
> {{COMPLEX}} as soon as it sees an unescaped {{_}}. The {{COMPLEX}} branch
> builds its regex as:
> {code:java}
> "^" + UDFLike.likePatternToRegExp(pattern) + "$"
> {code}
> {{UDFLike.likePatternToRegExp}} translates {{_}} to {{.}}, {{%}} to
> {{.*?}}, and other characters to {{Pattern.quote}}-wrapped literals. For
> pattern {{%information_schema%}} the produced regex
> (literals elided) is {{^.*?information.schema.*?$}}. {{ComplexChecker}}
> then compiles it at
> {code:java}
> compiledPattern = Pattern.compile(pattern);
> {code}
> The non-vectorized counterpart at
> {{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
> {code:java}
> p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
> {code}
> Without {{Pattern.DOTALL}}, {{.}} (and therefore {{.*?}}) does not match
> {{\n}}. The {{^...$}} anchors require the regex to consume the entire input,
> but {{.*?}} cannot cross newlines, so any
> multi-line input containing the literal substring is silently rejected.
> h2. Workaround
> Escape the underscore so the pattern no longer reaches {{ComplexChecker}}
> (it routes through one of the byte-substring checkers instead):
> {code:sql}
> SELECT q LIKE '%information\_schema%' FROM t;
> {code}
> Or disable vectorized execution for the query:
> {code:sql}
> SET hive.vectorized.execution.enabled=false;
> {code}
> h2. Reproduction
> {code:sql}
> SET hive.fetch.task.conversion=none;
> /* Materialization to ORC is required so the LIKE filter scans a vectorized
> input. UNION ALL of literals plans as a UDTF and disables vectorization
> on the map vertex; the materialized intermediate restores it. ORC is
> required because the default TextFile format treats \n as a row
> terminator and truncates multi-line strings. */
> SET hive.default.fileformat=ORC;
> SET hive.optimize.cte.materialize.threshold=1;
> SET hive.optimize.cte.materialize.full.aggregate.only=false;
> WITH t AS (
> SELECT 1 AS rid, 'simple match information_schema only one
> line' AS q
> UNION ALL SELECT 2 AS rid, concat('line one', chr(10), 'line two contains
> information_schema here', chr(10)) AS q
> UNION ALL SELECT 3 AS rid, concat('prefix', chr(10), 'more', chr(10),
> 'stuff', chr(10), 'information_schema appears late') AS q
> UNION ALL SELECT 4 AS rid, concat('information_schema is on the first
> line', chr(10), 'but there are more lines', chr(10), 'after it') AS q
> )
> SELECT rid,
> CASE WHEN q LIKE '%information_schema%' THEN 'match' ELSE 'NO_MATCH'
> END AS result
> FROM t
> ORDER BY rid;
> {code}
> *Expected:* all four rows return {{match}}. Each row's {{q}} contains the
> literal substring {{information_schema}}, and {{%information_schema%}}
> matches that substring under SQL semantics (the
> literal {{_}} satisfies the {{_}} single-char wildcard).
> *Actual:*
> {noformat}
> 1 match
> 2 NO_MATCH
> 3 NO_MATCH
> 4 NO_MATCH
> {noformat}
> Only row 1 (single-line) is matched. Rows 2, 3, and 4 each contain {{\n}}
> outside the {{information_schema}} run and are dropped. Running the same
> query with {{SET
> hive.vectorized.execution.enabled=false}} returns all four rows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)