[ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vishnuram selvaraj updated SPARK-26786:
---------------------------------------
    Description: 
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
(\{{}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: \{{}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

*Problem statement:* 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (
 r
 n)
 * The third record in that file has escaped unix newline character (
 n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}
But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+------------------+
|_c0| _c1|
+---+------------------+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+------------------+
{code}
 *Expected result:*
{code:java}
+---+------------------+
|_c0| _c1|
+---+------------------+
| 1|this is 
line1|
| 2| this is line2|
| 3| this is 
line3|
| 4| this is " line4|
| 5| this is line5|
+---+------------------+
{code}

  was:
There are some systems like AWS redshift which writes csv files by escaping 
newline characters('\r','\n') in addition to escaping the quote characters, if 
they come as part of the data.

Redshift documentation 
link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and below 
is their mention of escaping requirements in the mentioned link

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character 
(\{{}}) is placed before every occurrence of the following characters:
 * Linefeed: {{\n}}

 * Carriage return: {{\r}}

 * The delimiter character specified for the unloaded data.

 * The escape character: \{{}}

 * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
specified in the UNLOAD command).

 

*Problem statement:* 

But the spark CSV reader doesn't have a handle to treat/remove the escape 
characters infront of the newline characters in the data.

It would really help if we can add a feature to handle the escaped newline 
characters through another parameter like (escapeNewline = 'true/false').

*Example:*

Below are the details of my test data set up in a file.
 * The first record in that file has escaped windows newline character (
r
 n)
 * The third record in that file has escaped unix newline character (
 n)
 * The fifth record in that file has the escaped quote character (")

the file looks like below in vi editor:

 
{code:java}
"1","this is \^M\
line1"^M
"2","this is line2"^M
"3","this is \
line3"^M
"4","this is \" line4"^M
"5","this is line5"^M{code}
 

When I read the file in python's csv module with escape, it is able to remove 
the added escape characters as you can see below,

 
{code:java}
>>> with open('/tmp/test3.csv','r') as readCsv:
... readFile = 
csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
... for row in readFile:
... print(row)
...
['1', 'this is \r\n line1']
['2', 'this is line2']
['3', 'this is \n line3']
['4', 'this is " line4']
['5', 'this is line5']
{code}
But if I read the same file in spark-csv reader, the escape characters infront 
of the newline characters are not removed.But the escape before the (") is 
removed.
{code:java}
>>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
>>> redDf.show()
+---+------------------+
|_c0| _c1|
+---+------------------+
\ 1|this is \
line1|
| 2| this is line2|
| 3| this is \
line3|
| 4| this is " line4|
| 5| this is line5|
+---+------------------+
{code}
 


> Handle to treat escaped newline characters('\r','\n') in spark csv
> ------------------------------------------------------------------
>
>                 Key: SPARK-26786
>                 URL: https://issues.apache.org/jira/browse/SPARK-26786
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.0
>            Reporter: vishnuram selvaraj
>            Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}
>  *Expected result:*
> {code:java}
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to