Brady Tello created SPARK-37111: ----------------------------------- Summary: RDD file loading APIs throw URISyntaxException when there is a colon in the file path Key: SPARK-37111 URL: https://issues.apache.org/jira/browse/SPARK-37111 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Brady Tello
When a colon is present in a path to a file, many of Spark's RDD file loading APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException. The following code and stack trace example was generated on my laptop running Spark 3.2.0. {code:java} scala> val df = sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259) at org.apache.hadoop.fs.Path.<init>(Path.java:217) at org.apache.hadoop.fs.Path.<init>(Path.java:125) at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at org.apache.hadoop.fs.Globber.glob(Globber.java:149) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274) at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: java.net.URISyntaxException: Relative path in absolute URI: test:me at java.base/java.net.URI.checkPath(URI.java:1990) at java.base/java.net.URI.<init>(URI.java:780) at org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more {code} Why can't I just not use colons in my paths you ask? I'm running Spark on top of an S3 environment in which users are only permitted to read and write data to their personal S3 workspace and the path to their personal workspace contains a colon. Removing that colon would be a major architectural change to the entire authentication architecture for several apps outside of our Spark app and thus we don't really have the flexibility to remove it. Without a fix to this bug, users simply cannot use the RDD APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org