[
https://issues.apache.org/jira/browse/SPARK-44500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746705#comment-17746705
]
Pablo Langa Blanco commented on SPARK-44500:
--------------------------------------------
[[email protected]] What do you think?
> parse_url treats key as regular expression
> ------------------------------------------
>
> Key: SPARK-44500
> URL: https://issues.apache.org/jira/browse/SPARK-44500
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.4.1
> Reporter: Robert Joseph Evans
> Priority: Major
>
> To be clear I am not 100% sure that this is a bug. It might be a feature, but
> I don't see anywhere that it is used as a feature. If it is a feature it
> really should be documented, because there are pitfalls. If it is a bug it
> should be fixed because it is really confusing and it is simple to shoot
> yourself in the foot.
> ```scala
> > val urls = Seq("http://foo/bar?abc=BAD&a.c=GOOD",
> > "http://foo/bar?a.c=GOOD&abc=BAD").toDF
> > urls.selectExpr("parse_url(value, 'QUERY', 'a.c')").show(false)
> +----------------------------+
> |parse_url(value, QUERY, a.c)|
> +----------------------------+
> |BAD |
> |GOOD |
> +----------------------------+
> > urls.selectExpr("parse_url(value, 'QUERY', 'a[c')").show(false)
> java.util.regex.PatternSyntaxException: Unclosed character class near index 15
> (&|^)a[c=([^&]*)
> ^
> at java.util.regex.Pattern.error(Pattern.java:1969)
> at java.util.regex.Pattern.clazz(Pattern.java:2562)
> at java.util.regex.Pattern.sequence(Pattern.java:2077)
> at java.util.regex.Pattern.expr(Pattern.java:2010)
> at java.util.regex.Pattern.compile(Pattern.java:1702)
> at java.util.regex.Pattern.<init>(Pattern.java:1352)
> at java.util.regex.Pattern.compile(Pattern.java:1028)
> ```
> The simple fix is to quote the key when making the pattern.
> ```scala
> private def getPattern(key: UTF8String): Pattern = {
> Pattern.compile(REGEXPREFIX + Pattern.quote(key.toString) + REGEXSUBFIX)
> }
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]