[jira] [Updated] (SPARK-20155) CSV-files with quoted quotes can't be parsed, if delimiter follows quoted quote

Rick Moritz (JIRA) Mon, 24 Apr 2017 07:06:55 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rick Moritz updated SPARK-20155:
--------------------------------
    Description: 
According to :
https://tools.ietf.org/html/rfc4180#section-2

7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

This currently works as is, but the following does not:

 "aaa","b""b,b","ccc"

while  "aaa","b\"b,b","ccc" does get parsed.

I assume, this happens because quotes are currently being parsed in pairs, and 
that somehow ends up unquoting delimiter.

Edit: So future readers don't have to dive into the comments: A workaround (as 
of Spark 2.0) is to explicitely declare the escape character to be a double 
quote: (read.csv.option("escape","\""))

  was:
According to :
https://tools.ietf.org/html/rfc4180#section-2

7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

This currently works as is, but the following does not:

 "aaa","b""b,b","ccc"

while  "aaa","b\"b,b","ccc" does get parsed.

I assume, this happens because quotes are currently being parsed in pairs, and 
that somehow ends up unquoting delimiter.


> CSV-files with quoted quotes can't be parsed, if delimiter follows quoted 
> quote
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-20155
>                 URL: https://issues.apache.org/jira/browse/SPARK-20155
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>    Affects Versions: 2.0.0
>            Reporter: Rick Moritz
>
> According to :
> https://tools.ietf.org/html/rfc4180#section-2
> 7.  If double-quotes are used to enclose fields, then a double-quote
>        appearing inside a field must be escaped by preceding it with
>        another double quote.  For example:
>        "aaa","b""bb","ccc"
> This currently works as is, but the following does not:
>  "aaa","b""b,b","ccc"
> while  "aaa","b\"b,b","ccc" does get parsed.
> I assume, this happens because quotes are currently being parsed in pairs, 
> and that somehow ends up unquoting delimiter.
> Edit: So future readers don't have to dive into the comments: A workaround 
> (as of Spark 2.0) is to explicitely declare the escape character to be a 
> double quote: (read.csv.option("escape","\""))



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-20155) CSV-files with quoted quotes can't be parsed, if delimiter follows quoted quote

Reply via email to