Brady Tello created SPARK-37111:
-----------------------------------
Summary: RDD file loading APIs throw URISyntaxException when there
is a colon in the file path
Key: SPARK-37111
URL: https://issues.apache.org/jira/browse/SPARK-37111
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.2.0
Reporter: Brady Tello
When a colon is present in a path to a file, many of Spark's RDD file loading
APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.
The following code and stack trace example was generated on my laptop running
Spark 3.2.0.
{code:java}
scala> val df =
sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path
in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259)
at org.apache.hadoop.fs.Path.<init>(Path.java:217) at
org.apache.hadoop.fs.Path.<init>(Path.java:125) at
org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at
org.apache.hadoop.fs.Globber.glob(Globber.java:149) at
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
at
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at
scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at
scala.Option.getOrElse(Option.scala:189) at
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at
org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at
org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by:
java.net.URISyntaxException: Relative path in absolute URI: test:me at
java.base/java.net.URI.checkPath(URI.java:1990) at
java.base/java.net.URI.<init>(URI.java:780) at
org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
{code}
Why can't I just not use colons in my paths you ask? I'm running Spark on top
of an S3 environment in which users are only permitted to read and write data
to their personal S3 workspace and the path to their personal workspace
contains a colon. Removing that colon would be a major architectural change to
the entire authentication architecture for several apps outside of our Spark
app and thus we don't really have the flexibility to remove it. Without a fix
to this bug, users simply cannot use the RDD APIs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]