[ https://issues.apache.org/jira/browse/SPARK-23056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yash Datta updated SPARK-23056: ------------------------------- Description: When using internationalized Domains in the urls like: {code:java} val url = "http://правительство.рф" {code} The parse_url returns null, but works fine when using the hive 's version of parse_url On digging further, found that the difference is in below call in spark: {code:java} private def getUrl(url: UTF8String): URI = { try { new URI(url.toString) } catch { case e: URISyntaxException => null } } {code} while hive uses java.net.URL: {code:java} url = new URL(urlStr) {code} Sure enough, this simple test demonstrates URL works but URI does not in this case: {code:java} val url = "http://правительство.рф" val uriHost = new URI(url).getHost val urlHost = new URL(url).getHost println(s"uriHost = $uriHost") // prints uriHost = null println(s"urlHost = $urlHost") // prints urlHost = правительство.рф {code} To reproduce the problem on spark-sql: {code:java} spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST'); {code} returns NULL This problem was introduced by <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to improve the performance of PARSE_URL(). The same issue exists in the following SQL: {code:java} SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p') {code} // return null in Spark 2.1+ // return ["abc"] less than Spark 2.1 ``` was: When using internationalized Domains in the urls like: {code:java} val url = "http://правительство.рф" {code} The parse_url returns null, but works fine when using the hive 's version of parse_url On digging further, found that the difference is in below call in spark: {code:java} private def getUrl(url: UTF8String): URI = { try { new URI(url.toString) } catch { case e: URISyntaxException => null } } {code} while hive uses java.net.URL: {code:java} url = new URL(urlStr) {code} Sure enough, this simple test demonstrates URL works but URI does not in this case: {code:java} val url = "http://правительство.рф" val uriHost = new URI(url).getHost val urlHost = new URL(url).getHost println(s"uriHost = $uriHost") // prints uriHost = null println(s"urlHost = $urlHost") // prints urlHost = правительство.рф {code} To reproduce the problem on spark-sql: {code:java} spark-sql> select parse_url('http://千夏ともか.test', 'HOST'); {code} returns NULL This problem was introduced by <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to improve the performance of PARSE_URL(). The same issue exists in the following SQL: {code:java} SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p') {code} // return null in Spark 2.1+ // return ["abc"] less than Spark 2.1 ``` > parse_url regression when switched to using java.net.URI instead of > java.net.URL > -------------------------------------------------------------------------------- > > Key: SPARK-23056 > URL: https://issues.apache.org/jira/browse/SPARK-23056 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.3, 2.2.2, 2.3.0 > Reporter: Yash Datta > Labels: regression > > When using internationalized Domains in the urls like: > {code:java} > val url = "http://правительство.рф" > {code} > The parse_url returns null, but works fine when using the hive 's version of > parse_url > On digging further, found that the difference is in below call in spark: > {code:java} > private def getUrl(url: UTF8String): URI = { > try { > new URI(url.toString) > } catch { > case e: URISyntaxException => null > } > } > {code} > while hive uses java.net.URL: > {code:java} > url = new URL(urlStr) > {code} > Sure enough, this simple test demonstrates URL works but URI does not in this > case: > {code:java} > val url = "http://правительство.рф" > val uriHost = new URI(url).getHost > val urlHost = new URL(url).getHost > println(s"uriHost = $uriHost") // prints uriHost = null > println(s"urlHost = $urlHost") // prints urlHost = правительство.рф > {code} > To reproduce the problem on spark-sql: > {code:java} > spark-sql> select parse_url('http://日本語.JP/case/accessible/', 'HOST'); > {code} > returns NULL > This problem was introduced by > <https://issues.apache.org/jira/browse/SPARK-16826> which is designed to > improve the performance of PARSE_URL(). > The same issue exists in the following SQL: > {code:java} > SELECT PARSE_URL('http://stanzhai.site?p=["abc"]', 'QUERY', 'p') > {code} > // return null in Spark 2.1+ > // return ["abc"] less than Spark 2.1 > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org