Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Martin J. Dürst via Unicode Fri, 26 May 2017 03:35:34 -0700

On 2017/05/25 09:22, Markus Scherer wrote:

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <pub...@khwilliamson.com>
wrote:

On 05/24/2017 12:46 AM, Martin J. Dürst wrote:

That's wrong. There was a public review issue with various options and
with feedback, and the recommendation has been implemented and in use
widely (among else, in major programming language and browsers) without
problems for quite some time.


Could you supply a reference to the PRI and its feedback?


http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

It is correct that it didn't give any of the *examples* that are underdiscussion now. On the other hand, the PRI is very clear about what itmeans by "maximal subpart":


Citing directly from the PRI:

>>>>

The term "maximal subpart of the ill-formed subsequence" refers to thelongest potentially valid initial subsequence or, if none, then to thenext single code unit.

>>>>

At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629(https://tools.ietf.org/html/rfc3629) was published in 2003 to reflectthe tightening of the UTF-8 definition in Unicode/ISO 10646).

The recommendation in TUS 5.2 is "Replace each maximal subpart of an

ill-formed subsequence by a single U+FFFD."


You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

No, the text, both in the PRI and in Unicode 5.2, is quite clear. The"does not fit" (which I haven't found in either text) is clearlygrounded by "ill-formed UTF-8". And there's no question about what"ill-formed UTF-8" means, in particular in Unicode 5.2, where you justhave to go two pages back to find byte sequences such as <C0 AF>, <E0 9F80>, and <F4 80 83 92> all called out explicitly as ill-formed.

Any kind of claim, as in the L2/17-168 document, about there being anoption 2a, are just not substantiated. It's true that there are noexplicit examples in the PRI that would allow to distinguish betweenconverting e.g.

FC BF BF BF BF 80

to a single FFFD or to six of these. But there's no need to haveexamples for every corner case if the text is clear enough. In the abovesix-byte sequence, there's not a single potentially valid (initial)subsequence, so it's all single code units.

And I agree with that.  And I view an overlong sequence as a maximal
ill-formed subsequence

Can you point to any definition that would include or allow such aninterpretation? I just haven't found any yet, neither in the PRI nor inUnicode 5.2.

that should be replaced by a single FFFD. There's
nothing in the text of 5.2 that immediately follows that recommendation
that indicates to me that my view is incorrect.

I have to agree that the text in Unicode 5.2 could be clearer. It's ahodgepodge of attempts at justifications and definitions. And the word"maximal" itself may also contribute to pushing the interpretation inone direction.

But there's plenty in the text that makes it absolutely clear that somethings cannot be included. In particular, it says


>>>>

The term “maximal subpart of an ill-formed subsequence” refers to thecode units that were collected in this manner. They could be the startof a well-formed sequence, except that the sequence lacks the propercontinuation. Alternatively, the converter may have found ancontinuation code unit, which cannot be the start of a well-formed sequence.

>>>>

And the "in this manner" refers to:
>>>>

A sequence of code units will be processed up to the point where thesequence either can be unambiguously interpreted as a particular Unicodecode point or where the converter recognizes that the code unitscollected so far constitute an ill-formed subsequence.

>>>>

So we have the same thing twice: Bail out as soon as something isill-formed.

Perhaps my view is colored by the fact that I now maintain code that was
written to parse UTF-8 back when overlongs were still considered legal
input.

Thanks for providing this information. That's a lot more useful than"feels right", which was given as a reason on this list before.

An overlong was a single unit.  When they became illegal, the code
still considered them a single unit.

That's fine for your code. I might do the same (or not) if I were you,because one indeed never knows in which situation some code is used, andwhat repercussions a change might produce.

But the PRI, and the wording in Unicode 5.2, was created when overlongsand 5-byte and 6-byte sequences and surrogate pairs,... were veryclearly ill-formed already. If these texts had intended to make anexception for any of these cases, it would clearly have had to bewritten into the actual text.

Saying something like "the text isn't clear because it says ill-formed,but maybe it doesn't mean ill-formed at the time it was written, butquite a few years before" just doesn't make sense to me at all.

I can understand how someone who comes along later could say C0 can't be

followed by any continuation character that doesn't yield an overlong,
therefore C0 is a maximal subsequence.

Yes indeed, because maximal subsequences are defined by reference towell-formed/ill-formed subsequences, and what's ill-formed is defined inthe same standard at the same time.

There's nobody "coming along later". That kind of wording would beappropriate if the PRI and the recommendation in the standard had beenmade e.g. in the 1990ies, before the tightening of the UTF-8 definition.Then somebody could say that Unicode overlooked that they implicitlychanged the recommendation for how to generate U+FFFDs by changing thedefinition of well-formed UTF-8.

But no such thing at all happened. The PRI was evaluated, and therecommendation included in the text of Unicode, in the context of thethen-existing (and since then unchanged) definition of UTF-8.

But I assert that my interpretation is just as valid as that one.

Sorry, but it cannot be valid, because of the timing. The tightening ofthe UTF-8 definition happened years before the PRI.

And perhaps more so, because of historical precedent.

It's good to know that there are older implementations that behavedifferently. And I understand that in some cases, these might bereluctant to change. Mine, and Henri's, comments are very much motivatedby the fact that we are reluctant to change our implementations.

It may be worth to think about whether the Unicode standard shouldmention implementations like yours. But there should be no doubt aboutthe fact that the PRI and Unicode 5.2 (and the current version ofUnicode) are clear about what they recommend, and that thatrecommendation is based on the definition of UTF-8 at that time (andstill in force), and not at based on a historical definition of UTF-8.


Regards,   Martin.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Reply via email to