On 2023-05-13 Sa 04:20, Joel Jacobson wrote:
On Fri, May 12, 2023, at 21:57, Andrew Dunstan wrote:
Maybe this is unexpected by you, but it's not by me. What other sane
interpretation of that data could there be? And what CSV producer
outputs such horrible content? As you've noted, ours certainly does
not. Our rules are clear: quotes within quotes must be escaped
(default escape is by doubling the quote char). Allowing partial
fields to be quoted was a deliberate decision when CSV parsing was
implemented, because examples have been seen in the wild.
So I don't think our behaviour is broken or needs fixing. As
mentioned by Greg, this is an example of the adage about being
liberal in what you accept.
I understand your position, and your points are indeed in line with the
traditional "Robustness Principle" (aka "Postel's Law") [1] from 1980,
which
suggests "be conservative in what you send, be liberal in what you
accept."
However, I'd like to offer a different perspective that might be worth
considering.
A 2021 IETF draft, "The Harmful Consequences of the Robustness
Principle" [2],
argues that the flexibility advocated by Postel's Law can lead to
problems such
as unclear specifications and a multitude of varying implementations.
Features
that initially seem helpful can unexpectedly turn into bugs, resulting in
unanticipated consequences and data integrity risks.
Based on the feedback from you and others, I'd like to revise my earlier
proposal. Rather than adding an option to preserve the existing
behavior, I now
think it's better to simply report an error in such cases. This
approach offers
several benefits: it simplifies the CSV parser, reduces the risk of
misinterpreting data due to malformed input, and prevents the
all-too-familiar
situation where users blindly apply an error hint without
understanding the
consequences.
Finally, I acknowledge that we can't foresee the number of CSV
producers that
produce mid-field quoting, and this change may cause compatibility
issues for
some users. However, I consider this an acceptable tradeoff. Users
encountering
the error would receive a clear message explaining that mid-field
quoting is not
allowed and that they should change their CSV producer's settings to
escape
quotes by doubling the quote character. Importantly, this change
guarantees that
previously parsed data won't be misinterpreted, as it only enforces
stricter
parsing rules.
[1] https://datatracker.ietf.org/doc/html/rfc761#section-2.10
[2] https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html
I'm pretty reluctant to change something that's been working as designed
for almost 20 years, and about which we have hitherto had zero
complaints that I recall.
I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com