[
https://issues.apache.org/jira/browse/SPARK-37111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434112#comment-17434112
]
Hyukjin Kwon commented on SPARK-37111:
--------------------------------------
This is from Hadoop's limitation I believe.
> RDD file loading APIs throw URISyntaxException when there is a colon in the
> file path
> -------------------------------------------------------------------------------------
>
> Key: SPARK-37111
> URL: https://issues.apache.org/jira/browse/SPARK-37111
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.2.0
> Reporter: Brady Tello
> Priority: Major
>
> When a colon is present in a path to a file, many of Spark's RDD file loading
> APIs (textFile, wholeTextFile, possible others), throw a URISyntaxException.
> The following Scala code and stack trace example was generated on my laptop
> running Spark 3.2.0. I've verified that this issue also affects Python, and
> SQL and I'm assuming it probably also affects Java.
> {code:java}
> scala> val df =
> sc.wholeTextFiles("/Users/brady.tello/test:me/test.txt").take(1)
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative
> path in absolute URI: test:me at
> org.apache.hadoop.fs.Path.initialize(Path.java:259) at
> org.apache.hadoop.fs.Path.<init>(Path.java:217) at
> org.apache.hadoop.fs.Path.<init>(Path.java:125) at
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:303)
> at
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
> at
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
> at
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
> at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at
> scala.Option.getOrElse(Option.scala:189) at
> org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
> at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at
> scala.Option.getOrElse(Option.scala:189) at
> org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at
> org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at
> org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by:
> java.net.URISyntaxException: Relative path in absolute URI: test:me at
> java.base/java.net.URI.checkPath(URI.java:1990) at
> java.base/java.net.URI.<init>(URI.java:780) at
> org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
> {code}
> Why can't I just not use colons in my paths you ask? I'm running Spark on
> top of an S3 environment in which users are only permitted to read and write
> data to their personal S3 workspace and the path to their personal workspace
> contains a colon. Removing that colon would be a major architectural change
> to the entire authentication architecture for several apps outside of our
> Spark app and thus we don't really have the flexibility to remove it.
> Without a fix to this bug, users simply cannot use the RDD APIs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]