Laszlo Torok created SPARK-34050:
------------------------------------
Summary: Parquet 2 CSV conversion wrong quoting
Key: SPARK-34050
URL: https://issues.apache.org/jira/browse/SPARK-34050
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.4.0
Reporter: Laszlo Torok
Hi Experts,
I work for GE Corporate. We have a Backup+Restore+extars project with AWS.
I faced with incompatibility issues when tried to convert back parquet files to
CSV.
Our original sources (GreenPlum first) cannot process those backward converted
files because of unproper quoting.
We work on several kinds of ERPs and TechDatabses and there are:
# multiline (CR,CRLF,LF) text fields
# mixed quoting inside the fields or just one double quote in a text field
# we have text field where EmptyString and Null values can be placed and has
different meaning
Our last option combination is:
df.write.format("com.databricks.spark.csv").options(header='false',sep ='\013'
,multiLine ='true',escapeQuotes='true',quote = '"',nullValue ='\\N',
encoding='UTF-8').option("quoteAll",
'false').option("compression","gzip").mode('overwrite').save(s3_csv)
If I do not use escapeQuotes='true' it wont quote those fields where mixed or
once occures a double quote.
If I use this it will escape emptyString double quotes [sep]\"\"[sep] . ==> Our
Greenplum reader cannot read (Restoration) this format for emptyString.
It should be [sep]""[sep] or [sep][sep].
Can you help our project with proper quote and escape combination where data
looks like this:
"2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA\" S.A. | 13-APR-20
"
"290208407
| INT. RIEL DIN 2X32A 230/400V "
""
Thank you in advance!
Regards,
Laszlo Torok
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]