[
https://issues.apache.org/jira/browse/SPARK-22183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402042#comment-16402042
]
Michal Szafranski edited comment on SPARK-22183 at 3/16/18 3:20 PM:
--------------------------------------------------------------------
As for the reporters use case, using explicit 'contains()' function would not
just work around this issue, but also I would expect it to be significantly
faster. I don't think it is mapped in SQL though:
{code:java}
sqlContext.sql("SELECT * FROM test t").filter($"_1".contains($"_2")).show()
{code}
was (Author: michal.db):
As for the reporters use case, using explicit 'contains()' function would not
just work around this issue, but also I would expect it to be significantly
faster. I don't think it is mapped in SQL though:
{code:java}
sqlContext.sql("""SELECT * FROM test t""").filter($"_1".contains($"_2")).show()
{code}
> Inconsistency in LIKE escaping between literal values and column-based ones
> ---------------------------------------------------------------------------
>
> Key: SPARK-22183
> URL: https://issues.apache.org/jira/browse/SPARK-22183
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.0
> Reporter: Adrien Lavoillotte
> Priority: Minor
>
> I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to
> have filters & join conditions like:
> * Column A's value contains column B's
> * Column A's value contains some literal string
> So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since
> SparkSQL does not support {{LIKE expr ESCAPE char}}, I need to escape using
> \, and presumably also \ itself (twice in the case of literals, since '\\'
> represents a single \).
> But it seems that in a {{LIKE}} expression literal does not have quite the
> same escaping as other literal strings or non-literals {{LIKE}} expressions,
> seemingly depending on whether the left-hand side and/or right-hand side are
> literals or columns.
> Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in
> the body of this description, I'm purposedly using zero-width spaces to avoid
> Jira transforming my \.
> On Spark 2.2.0:
> {code}
> // both LHS & RHS literals
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+---------+
> | \|\ LIKE \\|
> +---+---------+
> | \| true|
> +---+---------+
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not
> allowed to end with the escape character;
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
> at
> org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
> at
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
> at
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
> at
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
> at
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
> ...
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+-------------+
> |a\b|a\b LIKE a\\b|
> +---+-------------+
> |a\b| true|
> +---+-------------+
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the
> escape character is not allowed to precede 'b';
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
> at
> org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
> ...
> // test data
> spark.sql("""SELECT * FROM test""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | Ok| ok|
> | a_b| a_b|
> | aab| a_b|
> | c%d| c%d|
> |caad| c%d|
> |e\nf|e\nf|
> | e
> f|e\nf|
> +----+----+
> // both column-based
> // not escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`,
> '([%_])', '\\\\$1')""").show()
> ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the
> escape character is not allowed to precede 'n';
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
> at
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
> at
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
> ...
> // escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`,
> '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> | _1| _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 2.2:
> * Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we
> get an error. So far so good.
> * If LHS is also a literal, same applies.
> * If RHS is a literal and LHS is a column, it seems that the string literal
> escaping of \ interferes with the {{LIKE}} escaping, since re-escaping \ does
> not match anymore. I would expect needing \\\\ between quotes to match a
> single \ (one escaping for the string literal, one for the {{LIKE}} escaping).
> On Spark 1.6 (and it seems until Spark 2.2.0):
> {code}
> // both LHS & RHS literals
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+-----+
> |_c0| _c1|
> +---+-----+
> | \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> +---+-----+
> |_c0| _c1|
> +---+-----+
> | \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> // Same test data as for 2.2
> // both column-based
> // not escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> REGEXP_REPLACE(`_2`, '([%_])', '\\\\$1')""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> REGEXP_REPLACE(`_2`, '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> | _1| _2|
> +----+----+
> | ok| ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> 'e\\nf'""").show()
> +----+----+
> | _1| _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE
> 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 1.6:
> * Cannot match a single-character literal string of \
> * Matching a string (literal or column) containing \ works whether you escape
> \ or not, which makes me think there is room for ambiguity (would \\n match
> \n?)
> * If RHS is a literal and LHS is a column, same issue as Spark 2.2
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]