Robert Joseph Evans created SPARK-44500:
-------------------------------------------
Summary: parse_url treats key as regular expression
Key: SPARK-44500
URL: https://issues.apache.org/jira/browse/SPARK-44500
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.1, 3.4.0, 3.3.0, 3.2.0
Reporter: Robert Joseph Evans
To be clear I am not 100% sure that this is a bug. It might be a feature, but I
don't see anywhere that it is used as a feature. If it is a feature it really
should be documented, because there are pitfalls. If it is a bug it should be
fixed because it is really confusing and it is simple to shoot yourself in the
foot.
```scala
> val urls = Seq("http://foo/bar?abc=BAD&a.c=GOOD",
> "http://foo/bar?a.c=GOOD&abc=BAD").toDF
> urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false)
+----------------------------+
|parse_url(value, QUERY, a.c)|
+----------------------------+
|BAD |
|GOOD |
+----------------------------+
> urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false)
java.util.regex.PatternSyntaxException: Unclosed character class near index 15
(&|^)a[c=([^&]*)
^
at java.util.regex.Pattern.error(Pattern.java:1969)
at java.util.regex.Pattern.clazz(Pattern.java:2562)
at java.util.regex.Pattern.sequence(Pattern.java:2077)
at java.util.regex.Pattern.expr(Pattern.java:2010)
at java.util.regex.Pattern.compile(Pattern.java:1702)
at java.util.regex.Pattern.<init>(Pattern.java:1352)
at java.util.regex.Pattern.compile(Pattern.java:1028)
```
The simple fix is to quote the key when making the pattern.
```scala
private def getPattern(key: UTF8String): Pattern = {
Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX)
}
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]