[jira] [Created] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

Brady Tello (Jira) Sun, 28 Mar 2021 12:41:04 -0700

Brady Tello created SPARK-34883:
-----------------------------------

             Summary: Setting CSV reader option "multiLine" to "true" causes 
URISyntaxException when colon is in file path
                 Key: SPARK-34883
                 URL: https://issues.apache.org/jira/browse/SPARK-34883
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.1, 3.0.0
            Reporter: Brady Tello



Setting the CSV reader's "multiLine" option to "True" throws the following 
exception when a ':' character is in the file path.

 
{code:java}
java.net.URISyntaxException: Relative path in absolute URI: test:dir
{code}
I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
whether I use Scala, Python, or SQL.

The following code works fine:

 
{code:java}
csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
tempDF = (spark.read.option("sep", "\t").csv(csvFile)
{code}
While the following code fails:

 
{code:java}
csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
tempDF = (spark.read.option("sep", "\t").option("multiLine", 
"True").csv(csvFile)
{code}
Full Stack Trace from Python:

 
{code:java}
--------------------------------------------------------------------------- 
IllegalArgumentException Traceback (most recent call last) <command-8965899> in 
<module> 3 #csvFile = 
"/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv" 4 ----> 5 tempDF = 
(spark.read # The DataFrameReader 6 .option("sep", "\t") # Use tab delimiter 
(default is comma-separator) 7 .option("multiLine", "True") 
/databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
sep, encoding, quote, escape, comment, header, inferSchema, 
ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, 
enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 735 
path = [path] 736 if type(path) == list: --> 737 return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 738 
elif isinstance(path, RDD): 739 def func(iterator): 
/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
__call__(self, *args) 1302 1303 answer = 
self.gateway_client.send_command(command) -> 1304 return_value = 
get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 
1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 114 # Hide 
where the exception came from that shows a non-Pythonic 115 # JVM exception 
message. --> 116 raise converted from None 117 else: 118 raise 
IllegalArgumentException: java.net.URISyntaxException: Relative path in 
absolute URI: test:dir
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

Reply via email to