[ 
https://issues.apache.org/jira/browse/SPARK-37111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434112#comment-17434112
 ] 

Hyukjin Kwon commented on SPARK-37111:
--------------------------------------

This is from Hadoop's limitation I believe.

> RDD file loading APIs throw URISyntaxException when there is a colon in the 
> file path
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-37111
>                 URL: https://issues.apache.org/jira/browse/SPARK-37111
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Brady Tello
>            Priority: Major
>
> When a colon is present in a path to a file, many of Spark's RDD file loading 
> APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.  
> The following Scala code and stack trace example was generated on my laptop 
> running Spark 3.2.0.   I've verified that this issue also affects Python, and 
> SQL and I'm assuming it probably also affects Java.
> {code:java}
> scala> val df = 
> sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:me at 
> org.apache.hadoop.fs.Path.initialize(Path.java:259) at 
> org.apache.hadoop.fs.Path.<init>(Path.java:217) at 
> org.apache.hadoop.fs.Path.<init>(Path.java:125) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
>  at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
>  at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
>  at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
> scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
> scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
> org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
> org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: 
> java.net.URISyntaxException: Relative path in absolute URI: test:me at 
> java.base/java.net.URI.checkPath(URI.java:1990) at 
> java.base/java.net.URI.<init>(URI.java:780) at 
> org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
> {code}
> Why can't I just not use colons in my paths you ask?  I'm running Spark on 
> top of an S3 environment in which users are only permitted to read and write 
> data to their personal S3 workspace and the path to their personal workspace 
> contains a colon.  Removing that colon would be a major architectural change 
> to the entire authentication architecture for several apps outside of our 
> Spark app and thus we don't really have the flexibility to remove it.  
> Without a fix to this bug, users simply cannot use the RDD APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to