[jira] [Created] (SPARK-34050) Parquet 2 CSV conversion wrong quoting

Laszlo Torok (Jira) Fri, 08 Jan 2021 04:39:08 -0800

Laszlo Torok created SPARK-34050:
------------------------------------

             Summary: Parquet 2 CSV conversion wrong quoting 
                 Key: SPARK-34050
                 URL: https://issues.apache.org/jira/browse/SPARK-34050
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.0
            Reporter: Laszlo Torok



Hi Experts,

 I work for GE Corporate. We have a Backup+Restore+extars project with AWS.
I faced with incompatibility issues when tried to convert back parquet files to 
CSV.
Our original sources (GreenPlum first) cannot process those backward converted 
files because of unproper quoting.



 

We work on several kinds of ERPs and TechDatabses and there are:
 # multiline (CR,CRLF,LF) text fields
 # mixed quoting inside the fields or just one double quote in a text field
 # we have text field where EmptyString and Null values can be placed and has 
different meaning



Our last option combination is:
df.write.format("com.databricks.spark.csv").options(header='false',sep ='\013' 
,multiLine ='true',escapeQuotes='true',quote = '"',nullValue ='\\N', 
encoding='UTF-8').option("quoteAll", 
'false').option("compression","gzip").mode('overwrite').save(s3_csv)

If I do not use escapeQuotes='true' it wont quote those fields where mixed or 
once occures a double quote.
If I use this it will escape emptyString double quotes [sep]\"\"[sep] . ==> Our 
Greenplum reader cannot read (Restoration) this format for emptyString.
It should be [sep]""[sep] or [sep][sep].

Can you help our project with proper quote and escape combination where data 
looks like this:

"2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA\" S.A. | 13-APR-20 
"

"290208407
 | INT. RIEL DIN 2X32A 230/400V "



""


Thank you in advance!

 

Regards,

Laszlo Torok



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-34050) Parquet 2 CSV conversion wrong quoting

Reply via email to