[ 
https://issues.apache.org/jira/browse/HIVE-29618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29618:
------------------------------------------
    Description: 
 h2. Description

  The vectorized {{LIKE}} operator's {{ComplexChecker}} compiles its regex via 
{{Pattern.compile(pattern)}} without the {{Pattern.DOTALL}} flag. Patterns 
containing an unescaped {{_}} wildcard route
  through this path. By Java {{Pattern}} default, {{.}} does not match {{\n}}, 
so the anchored regex {{^.*?<literal>.<literal>.*?$}} that the {{COMPLEX}} 
branch produces cannot consume newlines, and
  multi-line inputs containing the literal substring are silently rejected. The 
bug is silent.

  The non-vectorized counterpart {{UDFLike.evaluate}} 
({{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the 
same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
   Pattern.DOTALL)}} and returns the correct answer. The asymmetry was 
introduced by HIVE-22008 ("LIKE Operator should match multi-line input", 
2019-08-01), which added {{Pattern.DOTALL}} to
  {{UDFLike}} only and did not update the parallel implementation in 
{{AbstractFilterStringColLikeStringScalar.ComplexChecker}}, created earlier by 
HIVE-4642 (2013-10-03). The regression test added by
  HIVE-22008 ({{ql/src/test/queries/clientpositive/udf_like.q}}) does not 
enable vectorized execution, so the vectorized {{ComplexChecker}} path was 
never exercised by it.

  Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{pom.xml}} declares 
version {{4.3.0-SNAPSHOT}}).

  h2. When does the bug fire?

  All of the following co-occurring conditions are required:

  * Vectorized execution is on (default).
  * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard). 
This is the only path that reaches {{ComplexChecker}}; patterns without {{_}} 
route to non-regex {{NoneChecker}} /
  {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{ChainedChecker}}, 
which are byte-substring searches and handle newlines correctly.
  * The input string contains {{\n}} (or another line terminator that {{.}} 
does not match by default) outside the literal portion of the pattern; 
equivalently, the {{.*?}} segments of the produced
  regex would need to consume a newline to satisfy {{^...$}}.

  Removing any one condition masks the bug.

  h2. Mechanism

  The classifier 
{{AbstractFilterStringColLikeStringScalar.UDFLikePattern.matcher}} returns 
{{COMPLEX}} as soon as it sees an unescaped {{_}}. The {{COMPLEX}} branch 
builds its regex as:

  {code:java}
  "^" + UDFLike.likePatternToRegExp(pattern) + "$"
  {code}

  {{UDFLike.likePatternToRegExp}} translates {{_}} to {{.}}, {{%}} to {{.*?}}, 
and other characters to {{Pattern.quote}}-wrapped literals. For pattern 
{{%information_schema%}} the produced regex
  (literals elided) is {{^.*?information.schema.*?$}}. {{ComplexChecker}} then 
compiles it at

  {code:java}
  compiledPattern = Pattern.compile(pattern);
  {code}

  The non-vectorized counterpart at 
{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:

  {code:java}
  p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
  {code}

  Without {{Pattern.DOTALL}}, {{.}} (and therefore {{.*?}}) does not match 
{{\n}}. The {{^...$}} anchors require the regex to consume the entire input, 
but {{.*?}} cannot cross newlines, so any
  multi-line input containing the literal substring is silently rejected.

  h2. Workaround

  Escape the underscore so the pattern no longer reaches {{ComplexChecker}} (it 
routes through one of the byte-substring checkers instead):

  {code:sql}
  SELECT q LIKE '%information\_schema%' FROM t;
  {code}

  Or disable vectorized execution for the query:

  {code:sql}
  SET hive.vectorized.execution.enabled=false;
  {code}

  h2. Reproduction

  {code:sql}
  SET hive.fetch.task.conversion=none;
  /* Materialization to ORC is required so the LIKE filter scans a vectorized
     input. UNION ALL of literals plans as a UDTF and disables vectorization
     on the map vertex; the materialized intermediate restores it. ORC is
     required because the default TextFile format treats \n as a row
     terminator and truncates multi-line strings. */
  SET hive.default.fileformat=ORC;
  SET hive.optimize.cte.materialize.threshold=1;
  SET hive.optimize.cte.materialize.full.aggregate.only=false;

  WITH t AS (
              SELECT 1 AS rid, 'simple match information_schema only one line' 
AS q
    UNION ALL SELECT 2 AS rid, concat('line one', chr(10), 'line two contains 
information_schema here', chr(10)) AS q
    UNION ALL SELECT 3 AS rid, concat('prefix', chr(10), 'more', chr(10), 
'stuff', chr(10), 'information_schema appears late') AS q
    UNION ALL SELECT 4 AS rid, concat('information_schema is on the first 
line', chr(10), 'but there are more lines', chr(10), 'after it') AS q
  )
  SELECT rid,
         CASE WHEN q LIKE '%information_schema%' THEN 'match' ELSE 'NO_MATCH' 
END AS result
  FROM t
  ORDER BY rid;
  {code}

  *Expected:* all four rows return {{match}}. Each row's {{q}} contains the 
literal substring {{information_schema}}, and {{%information_schema%}} matches 
that substring under SQL semantics (the
  literal {{_}} satisfies the {{_}} single-char wildcard).

  *Actual:*

  {noformat}
  1    match
  2    NO_MATCH
  3    NO_MATCH
  4    NO_MATCH
  {noformat}

  Only row 1 (single-line) is matched. Rows 2, 3, and 4 each contain {{\n}} 
outside the {{information_schema}} run and are dropped. Running the same query 
with {{SET
  hive.vectorized.execution.enabled=false}} returns all four rows.

  was:
 h2. Description

  The vectorized \{{LIKE}} operator's \{{ComplexChecker}} compiles its regex 
via \{{Pattern.compile(pattern)}} without the \{{Pattern.DOTALL}} flag. 
Patterns containing an unescaped \{{_}} wildcard route
  through this path. By Java \{{Pattern}} default, \{{.}} does not match 
\{{\n}}, so the anchored regex \{{^.*?<literal>.<literal>.*?$}} that the 
\{{COMPLEX}} branch produces cannot consume newlines, and
  multi-line inputs containing the literal substring are silently rejected. The 
bug is silent.

  The non-vectorized counterpart \{{UDFLike.evaluate}} 
(\{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the 
same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
   Pattern.DOTALL)}} and returns the correct answer. The asymmetry was 
introduced by HIVE-22008 ("LIKE Operator should match multi-line input", 
2019-08-01), which added \{{Pattern.DOTALL}} to
  \{{UDFLike}} only and did not update the parallel implementation in 
\{{AbstractFilterStringColLikeStringScalar.ComplexChecker}}, created earlier by 
HIVE-4642 (2013-10-03). The regression test added by
  HIVE-22008 (\{{ql/src/test/queries/clientpositive/udf_like.q}}) does not 
enable vectorized execution, so the vectorized \{{ComplexChecker}} path was 
never exercised by it.

  Reproduced on Apache Hive master, commit \{{1516fb91e8}} (\{{pom.xml}} 
declares version \{{4.3.0-SNAPSHOT}}).

  h2. When does the bug fire?

  All of the following co-occurring conditions are required:

  * Vectorized execution is on (default).
  * The \{{LIKE}} pattern contains an unescaped \{{_}} (single-char wildcard). 
This is the only path that reaches \{{ComplexChecker}}; patterns without \{{_}} 
route to non-regex \{{NoneChecker}} /
  \{{BeginChecker}} / \{{EndChecker}} / \{{MiddleChecker}} / 
\{{ChainedChecker}}, which are byte-substring searches and handle newlines 
correctly.
  * The input string contains \{{\n}} (or another line terminator that \{{.}} 
does not match by default) outside the literal portion of the pattern; 
equivalently, the \{{.*?}} segments of the produced
  regex would need to consume a newline to satisfy \{{^...$}}.

  Removing any one condition masks the bug.

  h2. Mechanism

  The classifier 
\{{AbstractFilterStringColLikeStringScalar.UDFLikePattern.matcher}} returns 
\{{COMPLEX}} as soon as it sees an unescaped \{{_}}. The \{{COMPLEX}} branch 
builds its regex as:

  \{code:java}
  "^" + UDFLike.likePatternToRegExp(pattern) + "$"
  \{code}

  \{{UDFLike.likePatternToRegExp}} translates \{{_}} to \{{.}}, \{{%}} to 
\{{.*?}}, and other characters to \{{Pattern.quote}}-wrapped literals. For 
pattern \{{%information_schema%}} the produced regex
  (literals elided) is \{{^.*?information.schema.*?$}}. \{{ComplexChecker}} 
then compiles it at

  \{code:java}
  compiledPattern = Pattern.compile(pattern);
  \{code}

  The non-vectorized counterpart at 
\{{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:

  \{code:java}
  p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
  \{code}

  Without \{{Pattern.DOTALL}}, \{{.}} (and therefore \{{.*?}}) does not match 
\{{\n}}. The \{{^...$}} anchors require the regex to consume the entire input, 
but \{{.*?}} cannot cross newlines, so any
  multi-line input containing the literal substring is silently rejected.

  h2. Workaround

  Escape the underscore so the pattern no longer reaches \{{ComplexChecker}} 
(it routes through one of the byte-substring checkers instead):

  \{code:sql}
  SELECT q LIKE '%information\_schema%' FROM t;
  \{code}

  Or disable vectorized execution for the query:

  \{code:sql}
  SET hive.vectorized.execution.enabled=false;
  \{code}

  h2. Reproduction

  \{code:sql}
  SET hive.fetch.task.conversion=none;
  /* Materialization to ORC is required so the LIKE filter scans a vectorized
     input. UNION ALL of literals plans as a UDTF and disables vectorization
     on the map vertex; the materialized intermediate restores it. ORC is
     required because the default TextFile format treats \n as a row
     terminator and truncates multi-line strings. */
  SET hive.default.fileformat=ORC;
  SET hive.optimize.cte.materialize.threshold=1;
  SET hive.optimize.cte.materialize.full.aggregate.only=false;

  WITH t AS (
              SELECT 1 AS rid, 'simple match information_schema only one line' 
AS q
    UNION ALL SELECT 2 AS rid, concat('line one', chr(10), 'line two contains 
information_schema here', chr(10)) AS q
    UNION ALL SELECT 3 AS rid, concat('prefix', chr(10), 'more', chr(10), 
'stuff', chr(10), 'information_schema appears late') AS q
    UNION ALL SELECT 4 AS rid, concat('information_schema is on the first 
line', chr(10), 'but there are more lines', chr(10), 'after it') AS q
  )
  SELECT rid,
         CASE WHEN q LIKE '%information_schema%' THEN 'match' ELSE 'NO_MATCH' 
END AS result
  FROM t
  ORDER BY rid;
  \{code}

  *Expected:* all four rows return \{{match}}. Each row's \{{q}} contains the 
literal substring \{{information_schema}}, and \{{%information_schema%}} 
matches that substring under SQL semantics (the
  literal \{{_}} satisfies the \{{_}} single-char wildcard).

  *Actual:*

  \{noformat}
  1    match
  2    NO_MATCH
  3    NO_MATCH
  4    NO_MATCH
  \{noformat}

  Only row 1 (single-line) is matched. Rows 2, 3, and 4 each contain \{{\n}} 
outside the \{{information_schema}} run and are dropped. Running the same query 
with {{SET
  hive.vectorized.execution.enabled=false}} returns all four rows.


> Vectorized LIKE Operator should match multi-line input (HIVE-22008 followup)
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-29618
>                 URL: https://issues.apache.org/jira/browse/HIVE-29618
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>
>  h2. Description
>   The vectorized {{LIKE}} operator's {{ComplexChecker}} compiles its regex 
> via {{Pattern.compile(pattern)}} without the {{Pattern.DOTALL}} flag. 
> Patterns containing an unescaped {{_}} wildcard route
>   through this path. By Java {{Pattern}} default, {{.}} does not match 
> {{\n}}, so the anchored regex {{^.*?<literal>.<literal>.*?$}} that the 
> {{COMPLEX}} branch produces cannot consume newlines, and
>   multi-line inputs containing the literal substring are silently rejected. 
> The bug is silent.
>   The non-vectorized counterpart {{UDFLike.evaluate}} 
> ({{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}}) compiles the 
> same regex via {{Pattern.compile(likePatternToRegExp(strLikePattern),
>    Pattern.DOTALL)}} and returns the correct answer. The asymmetry was 
> introduced by HIVE-22008 ("LIKE Operator should match multi-line input", 
> 2019-08-01), which added {{Pattern.DOTALL}} to
>   {{UDFLike}} only and did not update the parallel implementation in 
> {{AbstractFilterStringColLikeStringScalar.ComplexChecker}}, created earlier 
> by HIVE-4642 (2013-10-03). The regression test added by
>   HIVE-22008 ({{ql/src/test/queries/clientpositive/udf_like.q}}) does not 
> enable vectorized execution, so the vectorized {{ComplexChecker}} path was 
> never exercised by it.
>   Reproduced on Apache Hive master, commit {{1516fb91e8}} ({{pom.xml}} 
> declares version {{4.3.0-SNAPSHOT}}).
>   h2. When does the bug fire?
>   All of the following co-occurring conditions are required:
>   * Vectorized execution is on (default).
>   * The {{LIKE}} pattern contains an unescaped {{_}} (single-char wildcard). 
> This is the only path that reaches {{ComplexChecker}}; patterns without {{_}} 
> route to non-regex {{NoneChecker}} /
>   {{BeginChecker}} / {{EndChecker}} / {{MiddleChecker}} / {{ChainedChecker}}, 
> which are byte-substring searches and handle newlines correctly.
>   * The input string contains {{\n}} (or another line terminator that {{.}} 
> does not match by default) outside the literal portion of the pattern; 
> equivalently, the {{.*?}} segments of the produced
>   regex would need to consume a newline to satisfy {{^...$}}.
>   Removing any one condition masks the bug.
>   h2. Mechanism
>   The classifier 
> {{AbstractFilterStringColLikeStringScalar.UDFLikePattern.matcher}} returns 
> {{COMPLEX}} as soon as it sees an unescaped {{_}}. The {{COMPLEX}} branch 
> builds its regex as:
>   {code:java}
>   "^" + UDFLike.likePatternToRegExp(pattern) + "$"
>   {code}
>   {{UDFLike.likePatternToRegExp}} translates {{_}} to {{.}}, {{%}} to 
> {{.*?}}, and other characters to {{Pattern.quote}}-wrapped literals. For 
> pattern {{%information_schema%}} the produced regex
>   (literals elided) is {{^.*?information.schema.*?$}}. {{ComplexChecker}} 
> then compiles it at
>   {code:java}
>   compiledPattern = Pattern.compile(pattern);
>   {code}
>   The non-vectorized counterpart at 
> {{ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java:194}} is:
>   {code:java}
>   p = Pattern.compile(likePatternToRegExp(strLikePattern), Pattern.DOTALL);
>   {code}
>   Without {{Pattern.DOTALL}}, {{.}} (and therefore {{.*?}}) does not match 
> {{\n}}. The {{^...$}} anchors require the regex to consume the entire input, 
> but {{.*?}} cannot cross newlines, so any
>   multi-line input containing the literal substring is silently rejected.
>   h2. Workaround
>   Escape the underscore so the pattern no longer reaches {{ComplexChecker}} 
> (it routes through one of the byte-substring checkers instead):
>   {code:sql}
>   SELECT q LIKE '%information\_schema%' FROM t;
>   {code}
>   Or disable vectorized execution for the query:
>   {code:sql}
>   SET hive.vectorized.execution.enabled=false;
>   {code}
>   h2. Reproduction
>   {code:sql}
>   SET hive.fetch.task.conversion=none;
>   /* Materialization to ORC is required so the LIKE filter scans a vectorized
>      input. UNION ALL of literals plans as a UDTF and disables vectorization
>      on the map vertex; the materialized intermediate restores it. ORC is
>      required because the default TextFile format treats \n as a row
>      terminator and truncates multi-line strings. */
>   SET hive.default.fileformat=ORC;
>   SET hive.optimize.cte.materialize.threshold=1;
>   SET hive.optimize.cte.materialize.full.aggregate.only=false;
>   WITH t AS (
>               SELECT 1 AS rid, 'simple match information_schema only one 
> line' AS q
>     UNION ALL SELECT 2 AS rid, concat('line one', chr(10), 'line two contains 
> information_schema here', chr(10)) AS q
>     UNION ALL SELECT 3 AS rid, concat('prefix', chr(10), 'more', chr(10), 
> 'stuff', chr(10), 'information_schema appears late') AS q
>     UNION ALL SELECT 4 AS rid, concat('information_schema is on the first 
> line', chr(10), 'but there are more lines', chr(10), 'after it') AS q
>   )
>   SELECT rid,
>          CASE WHEN q LIKE '%information_schema%' THEN 'match' ELSE 'NO_MATCH' 
> END AS result
>   FROM t
>   ORDER BY rid;
>   {code}
>   *Expected:* all four rows return {{match}}. Each row's {{q}} contains the 
> literal substring {{information_schema}}, and {{%information_schema%}} 
> matches that substring under SQL semantics (the
>   literal {{_}} satisfies the {{_}} single-char wildcard).
>   *Actual:*
>   {noformat}
>   1    match
>   2    NO_MATCH
>   3    NO_MATCH
>   4    NO_MATCH
>   {noformat}
>   Only row 1 (single-line) is matched. Rows 2, 3, and 4 each contain {{\n}} 
> outside the {{information_schema}} run and are dropped. Running the same 
> query with {{SET
>   hive.vectorized.execution.enabled=false}} returns all four rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to