[jira] [Commented] (SPARK-34050) Parquet 2 CSV conversion wrong quoting

Laszlo Torok (Jira) Mon, 11 Jan 2021 00:50:05 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262485#comment-17262485
 ]


Laszlo Torok commented on SPARK-34050:
--------------------------------------

Hi Hyukjin,

 

Thank you for your proactivity.

As I mentioned, I cannot share just little piece of data, but these "Before*" 
files represent how GreenPlum produces the Backup (GP S3 protocol) in CSV 
format.

After many retries/test I decided to use these options: 
escapeQuotes='true',quote = '"' 

But probably escapeQuotes='true' causes the empty string looks like 
[separator]\"\"[separator]  instead of [separator]""[separator].


But when I do not enter this option it did not quoted the "description" field 
--> 2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA" S.A. | 
13-APR-20  <--- double quote which caused an "Unterminated quote...' Error.

The third issue is, the multiline issue.
In the source tables there are many fields containing \n \r \r\n in various 
format.
We need to get back this records like broken into several lines in csv as a " 
quoted \n" field also.

Thanks,
Laszlo

> Parquet 2 CSV conversion wrong quoting 
> ---------------------------------------
>
>                 Key: SPARK-34050
>                 URL: https://issues.apache.org/jira/browse/SPARK-34050
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Laszlo Torok
>            Priority: Minor
>         Attachments: 
> After_PRQ2CSVConverion_part-00001-5610129a-bf88-4dda-86f2-878857e9ec54-c000.csv.gz,
>  Before_0c43b1dc7.csv.gz, Before_10e526e33.csv.gz, Before_142d26f3e0.csv.gz, 
> Before_60e291fb0.csv.gz, csv_2_parq, parquet2csv
>
>
> Hi Experts,
>  I work for GE Corporate. We have a Backup+Restore+extras project with AWS.
>  I faced with incompatibility issues when tried to convert back parquet files 
> to CSV.
>  Our original sources (GreenPlum first) cannot process those backward 
> converted files because of unproper quoting.
>  
> We work on several kinds of ERPs and TechDatabses and there are:
>  # multiline (CR,CRLF,LF) text fields
>  # mixed quoting inside the fields or just one double quote in a text field
>  # we have text field where EmptyString and Null values can be placed and has 
> different meaning
> Our last option combination is:
>  df.write.format("com.databricks.spark.csv").options(header='false',sep 
> ='\013' ,multiLine ='true',escapeQuotes='true',quote = '"',nullValue ='
> N', encoding='UTF-8').option("quoteAll", 
> 'false').option("compression","gzip").mode('overwrite').save(s3_csv)
> If I do not use escapeQuotes='true' it wont quote those fields where mixed or 
> once occures a double quote.
>  If I use this it will escape emptyString double quotes [sep]\"\"[sep] . ==> 
> Our Greenplum reader cannot read (Restoration) this format for emptyString.
>  It should be [sep]""[sep] or [sep][sep].
> Can you help our project with proper quote and escape combination where data 
> looks like this:
> "2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA\" S.A. | 
> 13-APR-20 "
> "290208407
> |INT. RIEL DIN 2X32A 230/400V "|
> ""
> I found an earlier option what you moved out: quoteMode.Non_Numeric. 
> Thank you in advance!
>  
> Regards,
> Laszlo Torok



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-34050) Parquet 2 CSV conversion wrong quoting

Reply via email to