[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

Brady Tello (Jira) Mon, 25 Oct 2021 07:25:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433793#comment-17433793
 ]


Brady Tello commented on SPARK-34883:
-------------------------------------

This issue actually affects more APIs than the JSON multiline reader.  You 
cannot read a text file using the RDD API in Spark 3.2.0 if there is a colon in 
your path.  The following stack trace was generated on my laptop using Spark 
3.2.0-SNAPSHOT.

 
{code:java}
scala> val df = sc.textFile("/Users/myUserName/test:me/test.txt").take(1) 
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: test:me at org.apache.hadoop.fs.Path.initialize(Path.java:259) 
at org.apache.hadoop.fs.Path.<init>(Path.java:217) at 
org.apache.hadoop.fs.Path.<init>(Path.java:125) at 
org.apache.hadoop.fs.Globber.doGlob(Globber.java:229) at 
org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034) at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:269)
 at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) at 
org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) 
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at 
scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at 
org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at 
org.apache.spark.rdd.RDD.take(RDD.scala:1422) ... 47 elided Caused by: 
java.net.URISyntaxException: Relative path in absolute URI: test:me at 
java.base/java.net.URI.checkPath(URI.java:1990) at 
java.base/java.net.URI.<init>(URI.java:780) at 
org.apache.hadoop.fs.Path.initialize(Path.java:256) ... 68 more
{code}

> Setting CSV reader option "multiLine" to "true" causes URISyntaxException 
> when colon is in file path
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34883
>                 URL: https://issues.apache.org/jira/browse/SPARK-34883
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0, 3.1.1
>            Reporter: Brady Tello
>            Priority: Major
>
> Setting the CSV reader's "multiLine" option to "True" throws the following 
> exception when a ':' character is in the file path.
>  
> {code:java}
> java.net.URISyntaxException: Relative path in absolute URI: test:dir
> {code}
> I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
> whether I use Scala, Python, or SQL.
> The following code works fine:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> tempDF = (spark.read.option("sep", "\t").csv(csvFile)
> {code}
> While the following code fails:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
> tempDF = (spark.read.option("sep", "\t").option("multiLine", 
> "True").csv(csvFile)
> {code}
> Full Stack Trace from Python:
>  
> {code:java}
> --------------------------------------------------------------------------- 
> IllegalArgumentException Traceback (most recent call last) <command-8965899> 
> in <module> 
> 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> 4 
> ----> 5  tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") 
> /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
> sep, encoding, quote, escape, comment, header, inferSchema, 
> ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
> positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
> maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
> columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, 
> samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
> recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 
> 735 path = [path] 
> 736 if type(path) == list: 
> --> 737 return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
> 738 elif isinstance(path, RDD): 
> 739 def func(iterator): 
> /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 
> 1302 
> 1303 answer = self.gateway_client.send_command(command) 
> -> 1304 return_value = get_return_value( 
> 1305 answer, self.gateway_client, self.target_id, self.name) 
> 1306 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 
> 114 # Hide where the exception came from that shows a non-Pythonic 
> 115 # JVM exception message. 
> --> 116 raise converted from None 
> 117 else: 
> 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:dir
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

Reply via email to