[ https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579974#comment-16579974 ]
Ondrej Kokes commented on SPARK-22236: -------------------------------------- Multiline=true by default would cause some slowdown, but data quality would either increase or stay the same - it would never go down. So the discussion there is mostly about performance. With escape changes, while we would see improvements in data quality on the input side, *people with preexisting datasets exported by Spark would suffer (unexpected) data loss,* because the escaping strategy would potentially differ from the time the data was written. I think that's a bit more important aspect to consider. > CSV I/O: does not respect RFC 4180 > ---------------------------------- > > Key: SPARK-22236 > URL: https://issues.apache.org/jira/browse/SPARK-22236 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.2.0 > Reporter: Ondrej Kokes > Priority: Minor > > When reading or writing CSV files with Spark, double quotes are escaped with > a backslash by default. However, the appropriate behaviour as set out by RFC > 4180 (and adhered to by many software packages) is to escape using a second > double quote. > This piece of Python code demonstrates the issue > {code} > import csv > with open('testfile.csv', 'w') as f: > cw = csv.writer(f) > cw.writerow(['a 2.5" drive', 'another column']) > cw.writerow(['a "quoted" string', '"quoted"']) > cw.writerow([1,2]) > with open('testfile.csv') as f: > print(f.read()) > # "a 2.5"" drive",another column > # "a ""quoted"" string","""quoted""" > # 1,2 > spark.read.csv('testfile.csv').collect() > # [Row(_c0='"a 2.5"" drive"', _c1='another column'), > # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), > # Row(_c0='1', _c1='2')] > # explicitly stating the escape character fixed the issue > spark.read.option('escape', '"').csv('testfile.csv').collect() > # [Row(_c0='a 2.5" drive', _c1='another column'), > # Row(_c0='a "quoted" string', _c1='"quoted"'), > # Row(_c0='1', _c1='2')] > {code} > The same applies to writes, where reading the file written by Spark may > result in garbage. > {code} > df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file > correctly > df.write.format("csv").save('testout.csv') > with open('testout.csv/part-....csv') as f: > cr = csv.reader(f) > print(next(cr)) > print(next(cr)) > # ['a 2.5\\ drive"', 'another column'] > # ['a \\quoted\\" string"', '\\quoted\\""'] > {code} > The culprit is in > [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91], > where the default escape character is overridden. > While it's possible to work with CSV files in a "compatible" manner, it would > be useful if Spark had sensible defaults that conform to the above-mentioned > RFC (as well as W3C recommendations). I realise this would be a breaking > change and thus if accepted, it would probably need to result in a warning > first, before moving to a new default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org