Yash Datta created SPARK-23056: ---------------------------------- Summary: parse_url regression when switched to using java.net.URI instead of java.net.URL Key: SPARK-23056 URL: https://issues.apache.org/jira/browse/SPARK-23056 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.3, 2.2.2, 2.3.0 Reporter: Yash Datta
When using internationalized Domains in the urls like: val url = "http://правительство.рф" The parse_url returns null, but works fine when using the hive 's version of parse_url On digging further, found that the difference is in below call in spark: private def getUrl(url: UTF8String): URI = { try { new URI(url.toString) } catch { case e: URISyntaxException => null } } while hive uses java.net.URL: url = new URL(urlStr) Sure enough, this simple test demonstrates URL works but URI does not in this case: val url = "http://правительство.рф" val uriHost = new URI(url).getHost val urlHost = new URL(url).getHost println(s"uriHost = $uriHost") // prints uriHost = null println(s"urlHost = $urlHost") // prints urlHost = правительство.рф To reproduce the problem on spark-sql: spark-sql> select parse_url('http://千夏ともか.test', 'HOST'); returns NULL This problem was introduced by <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to improve the performance of PARSE_URL(). The same issue exists in the following SQL: ```SQL SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p') // return null in Spark 2.1+ // return ["abc"] less than Spark 2.1 ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org