Yash Datta created SPARK-23056:
----------------------------------

             Summary: parse_url regression when switched to using java.net.URI 
instead of java.net.URL
                 Key: SPARK-23056
                 URL: https://issues.apache.org/jira/browse/SPARK-23056
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.3, 2.2.2, 2.3.0
            Reporter: Yash Datta


When using internationalized Domains in the urls like:

val url = "http://правительство.рф";
The parse_url returns null, but works fine when using the hive 's version of 
parse_url

On digging further, found that the difference is in below call in spark:

private def getUrl(url: UTF8String): URI = {
  try {
    new URI(url.toString)
  } catch {
    case e: URISyntaxException => null
  }
}

while hive uses java.net.URL:

url = new URL(urlStr)

Sure enough, this simple test demonstrates URL works but URI does not in this 
case:

val url = "http://правительство.рф";

val uriHost = new URI(url).getHost
val urlHost = new URL(url).getHost

println(s"uriHost = $uriHost")     // prints uriHost = null
println(s"urlHost = $urlHost") // prints urlHost = правительство.рф  

To reproduce the problem on spark-sql:

spark-sql> select parse_url('http://千夏ともか.test', 'HOST');
returns NULL

This problem was introduced by
<https://issues.apache.org/jira/browse/SPARK-16826> which is designed to
improve the performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc";]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to