Ondrej Kokes created SPARK-22236: ------------------------------------ Summary: CSV I/O: does not respect RFC 4180 Key: SPARK-22236 URL: https://issues.apache.org/jira/browse/SPARK-22236 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.2.0 Reporter: Ondrej Kokes Priority: Minor
When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote. This piece of Python code demonstrates the issue {code} import csv with open('testfile.csv', 'w') as f: cw = csv.writer(f) cw.writerow(['a 2.5" drive', 'another column']) cw.writerow(['a "quoted" string', '"quoted"']) cw.writerow([1,2]) with open('testfile.csv') as f: print(f.read()) # "a 2.5"" drive",another column # "a ""quoted"" string","""quoted""" # 1,2 spark.read.csv('testfile.csv').collect() # [Row(_c0='"a 2.5"" drive"', _c1='another column'), # Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'), # Row(_c0='1', _c1='2')] # explicitly stating the escape character fixed the issue spark.read.option('escape', '"').csv('testfile.csv').collect() # [Row(_c0='a 2.5" drive', _c1='another column'), # Row(_c0='a "quoted" string', _c1='"quoted"'), # Row(_c0='1', _c1='2')] {code} The same applies to writes, where reading the file written by Spark may result in garbage. {code} df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly df.write.format("csv").save('testout.csv') with open('testout.csv/part-....csv') as f: cr = csv.reader(f) print(next(cr)) print(next(cr)) # ['a 2.5\\ drive"', 'another column'] # ['a \\quoted\\" string"', '\\quoted\\""'] {code} While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org