Re: Should CSV parsing be stricter about mid-field quotes?

Andrew Dunstan Sat, 13 May 2023 05:45:09 -0700


On 2023-05-13 Sa 04:20, Joel Jacobson wrote:

On Fri, May 12, 2023, at 21:57, Andrew Dunstan wrote:
Maybe this is unexpected by you, but it's not by me. What other saneinterpretation of that data could there be? And what CSV produceroutputs such horrible content? As you've noted, ours certainly doesnot. Our rules are clear: quotes within quotes must be escaped(default escape is by doubling the quote char). Allowing partialfields to be quoted was a deliberate decision when CSV parsing wasimplemented, because examples have been seen in the wild.
So I don't think our behaviour is broken or needs fixing. Asmentioned by Greg, this is an example of the adage about beingliberal in what you accept.
I understand your position, and your points are indeed in line with the
traditional "Robustness Principle" (aka "Postel's Law") [1] from 1980,whichsuggests "be conservative in what you send, be liberal in what youaccept."
However, I'd like to offer a different perspective that might be worth
considering.
A 2021 IETF draft, "The Harmful Consequences of the RobustnessPrinciple" [2],argues that the flexibility advocated by Postel's Law can lead toproblems suchas unclear specifications and a multitude of varying implementations.Features
that initially seem helpful can unexpectedly turn into bugs, resulting in
unanticipated consequences and data integrity risks.

Based on the feedback from you and others, I'd like to revise my earlier
proposal. Rather than adding an option to preserve the existingbehavior, I nowthink it's better to simply report an error in such cases. Thisapproach offers
several benefits: it simplifies the CSV parser, reduces the risk of
misinterpreting data due to malformed input, and prevents theall-too-familiarsituation where users blindly apply an error hint withoutunderstanding the
consequences.
Finally, I acknowledge that we can't foresee the number of CSVproducers thatproduce mid-field quoting, and this change may cause compatibilityissues forsome users. However, I consider this an acceptable tradeoff. Usersencounteringthe error would receive a clear message explaining that mid-fieldquoting is notallowed and that they should change their CSV producer's settings toescapequotes by doubling the quote character. Importantly, this changeguarantees thatpreviously parsed data won't be misinterpreted, as it only enforcesstricter
parsing rules.

[1] https://datatracker.ietf.org/doc/html/rfc761#section-2.10
[2] https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html

I'm pretty reluctant to change something that's been working as designedfor almost 20 years, and about which we have hitherto had zerocomplaints that I recall.

I could see an argument for a STRICT mode which would disallow partiallyquoted fields, although I'd like some evidence that we're dealing with areal problem here. Is there really a CSV producer that produces outputlike that you showed in your example? And if so has anyone objected tothem about the insanity of that?



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Should CSV parsing be stricter about mid-field quotes?

Reply via email to