Hi, On 2025-02-15 12:35:45 -0800, Jeff Davis wrote: > I am not suggesting a change, but there's a minor point about the > behavior of the replacement that I'd like to highlight: > > Unicode discusses a choice[1]: "An ill-formed subsequence consisting of > more than one code unit could be treated as a single error or as > multiple errors." > > The patch implements the latter. Escaping: > <7A F0 80 80 41 7A> > results in: > <7A C0 20 C0 20 C0 20 41 7A> > > The Unicode standard suggests[2] that the former approach may provide > more consistency in how it's done, but that doesn't seem important or > relevant for our purposes. I'd favor whichever approach results in > simpler code.
It seems completely infeasible to me to to implement the "single error" approach in a minor version. It'd afaict require non-trivial new infrastructure. We can't just consume up to the next byte without a high bit, because some encodings have subsequent bytes that are not guaranteed to have a high bit set. Greetings, Andres Freund