[
https://issues.apache.org/jira/browse/CSV-294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joern Huxhorn updated CSV-294:
------------------------------
Description:
Reading data that contains " does not work if escape character is manually set
to {{'"'}} as specified in [RFC
4180|https://datatracker.ietf.org/doc/html/rfc4180]. It works for other escape
characters or if no escape character is defined in the format.
{{CSVFormat.DEFAULT}} or at least {{CSVFormat.RFC4180}} and {{CSVFormat.EXCEL}}
should have escape character set to '"' instead of {{null}} by default.
This line in {{Lexer.java}} is responsible for the originally quite erroneous
ticket:
{{this.escape = mapNullToDisabled(format.getEscapeCharacter());}}
>From this line I (wrongly) deduced that an unspecified escape character would
>actually disable escaping. Because of that I wanted to enable it by setting it
>to {{'"'}} which causes exceptions in the Lexer for perfectly valid input.
>That in turn convinced my that this is a way bigger issue than it is. Sorry
>about that.
I don't think that the current situation is ideal, though. I would not have
been this confused if {{CSVFormat}} would be more explicit about the escape
char that will be used, i.e. if {{toString()}} would show the implicitly used
quote character. It is currently omitted from the output if it is not set
explcitly.
h4. Relevant part of the RFC:
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
h4. Related issue:
https://issues.apache.org/jira/browse/CSV-150
was:
Writing and reading data that contains " does not work even if escape character
is set to '"' as specified in [RFC
4180|https://datatracker.ietf.org/doc/html/rfc4180]. It works for other escape
characters.
It *does not work* if no escape character is specified at all, which was
reported in CSV-150.
This means that the default {{CSVFormat}} constants are unable to handle data
that contain " somewhere in the middle of the string.
{{CSVFormat.DEFAULT}} or at least {{CSVFormat.RFC4180}} and {{CSVFormat.EXCEL}}
should have escape character set to '"' by default, as defined in the RFC.
This is also the way Excel escapes ", i.e. Excel is behaving as specified in
RFC 4180 but commons-csv isn't.
I upgraded this ticket to *Critical* since the current default behavior will
cause broken CSV files that can't be consumed with commons-csv and changing the
default to what the RFC defines has a similar effect.
h4. Relevant part of the RFC:
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
h4. Related issue:
https://issues.apache.org/jira/browse/CSV-150
> CSVFormat does not support " as escape char
> -------------------------------------------
>
> Key: CSV-294
> URL: https://issues.apache.org/jira/browse/CSV-294
> Project: Commons CSV
> Issue Type: Bug
> Affects Versions: 1.9.0
> Reporter: Joern Huxhorn
> Priority: Critical
>
> Reading data that contains " does not work if escape character is manually
> set to {{'"'}} as specified in [RFC
> 4180|https://datatracker.ietf.org/doc/html/rfc4180]. It works for other
> escape characters or if no escape character is defined in the format.
> {{CSVFormat.DEFAULT}} or at least {{CSVFormat.RFC4180}} and
> {{CSVFormat.EXCEL}} should have escape character set to '"' instead of
> {{null}} by default.
> This line in {{Lexer.java}} is responsible for the originally quite erroneous
> ticket:
> {{this.escape = mapNullToDisabled(format.getEscapeCharacter());}}
> From this line I (wrongly) deduced that an unspecified escape character would
> actually disable escaping. Because of that I wanted to enable it by setting
> it to {{'"'}} which causes exceptions in the Lexer for perfectly valid input.
> That in turn convinced my that this is a way bigger issue than it is. Sorry
> about that.
> I don't think that the current situation is ideal, though. I would not have
> been this confused if {{CSVFormat}} would be more explicit about the escape
> char that will be used, i.e. if {{toString()}} would show the implicitly used
> quote character. It is currently omitted from the output if it is not set
> explcitly.
> h4. Relevant part of the RFC:
> 7. If double-quotes are used to enclose fields, then a double-quote
> appearing inside a field must be escaped by preceding it with
> another double quote. For example:
> "aaa","b""bb","ccc"
> h4. Related issue:
> https://issues.apache.org/jira/browse/CSV-150
--
This message was sent by Atlassian Jira
(v8.20.1#820001)