[
https://issues.apache.org/jira/browse/SPARK-22183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401738#comment-16401738
]
Michal Szafranski edited comment on SPARK-22183 at 3/16/18 11:02 AM:
---------------------------------------------------------------------
This seems to be an issue with 'LikeSimplification' rule. When it replaces Like
with a simple function (equals, startsWith, etc.) it does not remove escapes
from the pattern derived parameter. So
{code:java}
'\\' LIKE '\\\\'{code}
is replaced with
{code:java}
'\\' = '\\\\'{code}
which is incorrect.
The reported inconsistency is due to this rule being applied or not. When both
LHS and RHS are constants then LIKE is constant folded before
LikeSimplification rule, and since pattern (RHS) needs to be constant to apply
the rule this leaves `column LIKE constant` case for the problem to trigger.
was (Author: michal.db):
This seems to be an issue with 'LikeSimplification' rule. When it replaces Like
with a simple function (equals, startsWith, etc.) it does not remove escapes
from the pattern derived parameter. So `'\\' LIKE '\\\\'` is replaced with
`'\\' = '\\\\'` which is incorrect.
The reported inconsistency is due to this rule being applied or not. When both
LHS and RHS are constants then LIKE is constant folded before
LikeSimplification rule, and since pattern (RHS) needs to be constant to apply
the rule this leaves `column LIKE constant` case for the problem to trigger.
> Inconsistency in LIKE escaping between literal values and column-based ones
> ---------------------------------------------------------------------------
>
> Key: SPARK-22183
> URL: https://issues.apache.org/jira/browse/SPARK-22183
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Adrien Lavoillotte
> Priority: Minor
>
> I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to
> have filters & join conditions like:
> * Column A's value contains column B's
> * Column A's value contains some literal string
> So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since
> SparkSQL does not support {{LIKE expr ESCAPE char}}, I need to escape using
> \, and presumably also \ itself (twice in the case of literals, since '\\'
> represents a single \).
> But it seems that in a {{LIKE}} expression literal does not have quite the
> same escaping as other literal strings or non-literals {{LIKE}} expressions,
> seemingly depending on whether the left-hand side and/or right-hand side are
> literals or columns.
> Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in
> the body of this description, I'm purposedly using zero-width spaces to avoid
> Jira transforming my \.
> On Spark 2.2.0:
> {code}
> // both LHS & RHS literals
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+---------+
> | \|\ LIKE \\|
> +---+---------+
> | \| true|
> +---+---------+
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not
> allowed to end with the escape character;
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
> at
> org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
> at
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
> at
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
> at
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
> at
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
> ...
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+-------------+
> |a\b|a\b LIKE a\\b|
> +---+-------------+
> |a\b| true|
> +---+-------------+
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the
> escape character is not allowed to precede 'b';
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
> at
> org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
> ...
> // test data
> spark.sql("""SELECT * FROM test""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | Ok| ok|
> | a_b| a_b|
> | aab| a_b|
> | c%d| c%d|
> |caad| c%d|
> |e\nf|e\nf|
> | e
> f|e\nf|
> +----+----+
> // both column-based
> // not escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`,
> '([%_])', '\\\\$1')""").show()
> ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the
> escape character is not allowed to precede 'n';
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
> at
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
> ...
> // escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`,
> '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> | _1| _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 2.2:
> * Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we
> get an error. So far so good.
> * If LHS is also a literal, same applies.
> * If RHS is a literal and LHS is a column, it seems that the string literal
> escaping of \ interferes with the {{LIKE}} escaping, since re-escaping \ does
> not match anymore. I would expect needing \\\\ between quotes to match a
> single \ (one escaping for the string literal, one for the {{LIKE}} escaping).
> On Spark 1.6 (and it seems until Spark 2.2.0):
> {code}
> // both LHS & RHS literals
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+-----+
> |_c0| _c1|
> +---+-----+
> | \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> +---+-----+
> |_c0| _c1|
> +---+-----+
> | \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> // Same test data as for 2.2
> // both column-based
> // not escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> REGEXP_REPLACE(`_2`, '([%_])', '\\\\$1')""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> REGEXP_REPLACE(`_2`, '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> 'e\\nf'""").show()
> +----+----+
> | _1| _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 1.6:
> * Cannot match a single-character literal string of \
> * Matching a string (literal or column) containing \ works whether you escape
> \ or not, which makes me think there is room for ambiguity (would \\n match
> \n?)
> * If RHS is a literal and LHS is a column, same issue as Spark 2.2
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]