Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Markus Scherer via Unicode Wed, 24 May 2017 17:28:51 -0700

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <pub...@khwilliamson.com>
wrote:


> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>

http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>

You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>

Right.

I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>

Right.

But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>

I agree.

markus

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to