Brady Tello created SPARK-37111:
-----------------------------------

             Summary: RDD file loading APIs throw URISyntaxException when there 
is a colon in the file path
                 Key: SPARK-37111
                 URL: https://issues.apache.org/jira/browse/SPARK-37111
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.0
            Reporter: Brady Tello


When a colon is present in a path to a file, many of Spark's RDD file loading 
APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.  
The following code and stack trace example was generated on my laptop running 
Spark 3.2.0.  
{code:java}
scala> val df = 
sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) 
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259) 
at org.apache.hadoop.fs.Path.<init>(Path.java:217) at 
org.apache.hadoop.fs.Path.<init>(Path.java:125) at 
org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at 
org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
 at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
 at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: 
java.net.URISyntaxException: Relative path in absolute URI: test:me at 
java.base/java.net.URI.checkPath(URI.java:1990) at 
java.base/java.net.URI.<init>(URI.java:780) at 
org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
{code}
Why can't I just not use colons in my paths you ask?  I'm running Spark on top 
of an S3 environment in which users are only permitted to read and write data 
to their personal S3 workspace and the path to their personal workspace 
contains a colon.  Removing that colon would be a major architectural change to 
the entire authentication architecture for several apps outside of our Spark 
app and thus we don't really have the flexibility to remove it.  Without a fix 
to this bug, users simply cannot use the RDD APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to