On 2017/05/25 09:22, Markus Scherer wrote:
On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <pub...@khwilliamson.com>
wrote:

On 05/24/2017 12:46 AM, Martin J. Dürst wrote:

That's wrong. There was a public review issue with various options and
with feedback, and the recommendation has been implemented and in use
widely (among else, in major programming language and browsers) without
problems for quite some time.


Could you supply a reference to the PRI and its feedback?


http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)

It is correct that it didn't give any of the *examples* that are under discussion now. On the other hand, the PRI is very clear about what it means by "maximal subpart":

Citing directly from the PRI:

>>>>
The term "maximal subpart of the ill-formed subsequence" refers to the longest potentially valid initial subsequence or, if none, then to the next single code unit.
>>>>

At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629 (https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect the tightening of the UTF-8 definition in Unicode/ISO 10646).

The recommendation in TUS 5.2 is "Replace each maximal subpart of an
ill-formed subsequence by a single U+FFFD."


You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.

No, the text, both in the PRI and in Unicode 5.2, is quite clear. The "does not fit" (which I haven't found in either text) is clearly grounded by "ill-formed UTF-8". And there's no question about what "ill-formed UTF-8" means, in particular in Unicode 5.2, where you just have to go two pages back to find byte sequences such as <C0 AF>, <E0 9F 80>, and <F4 80 83 92> all called out explicitly as ill-formed.

Any kind of claim, as in the L2/17-168 document, about there being an option 2a, are just not substantiated. It's true that there are no explicit examples in the PRI that would allow to distinguish between converting e.g.
FC BF BF BF BF 80
to a single FFFD or to six of these. But there's no need to have examples for every corner case if the text is clear enough. In the above six-byte sequence, there's not a single potentially valid (initial) subsequence, so it's all single code units.


And I agree with that.  And I view an overlong sequence as a maximal
ill-formed subsequence

Can you point to any definition that would include or allow such an interpretation? I just haven't found any yet, neither in the PRI nor in Unicode 5.2.

that should be replaced by a single FFFD. There's
nothing in the text of 5.2 that immediately follows that recommendation
that indicates to me that my view is incorrect.

I have to agree that the text in Unicode 5.2 could be clearer. It's a hodgepodge of attempts at justifications and definitions. And the word "maximal" itself may also contribute to pushing the interpretation in one direction.

But there's plenty in the text that makes it absolutely clear that some things cannot be included. In particular, it says

>>>>
The term “maximal subpart of an ill-formed subsequence” refers to the code units that were collected in this manner. They could be the start of a well-formed sequence, except that the sequence lacks the proper continuation. Alternatively, the converter may have found an continuation code unit, which cannot be the start of a well-formed sequence.
>>>>

And the "in this manner" refers to:
>>>>
A sequence of code units will be processed up to the point where the sequence either can be unambiguously interpreted as a particular Unicode code point or where the converter recognizes that the code units collected so far constitute an ill-formed subsequence.
>>>>

So we have the same thing twice: Bail out as soon as something is ill-formed.


Perhaps my view is colored by the fact that I now maintain code that was
written to parse UTF-8 back when overlongs were still considered legal
input.

Thanks for providing this information. That's a lot more useful than "feels right", which was given as a reason on this list before.


An overlong was a single unit.  When they became illegal, the code
still considered them a single unit.

That's fine for your code. I might do the same (or not) if I were you, because one indeed never knows in which situation some code is used, and what repercussions a change might produce.

But the PRI, and the wording in Unicode 5.2, was created when overlongs and 5-byte and 6-byte sequences and surrogate pairs,... were very clearly ill-formed already. If these texts had intended to make an exception for any of these cases, it would clearly have had to be written into the actual text.

Saying something like "the text isn't clear because it says ill-formed, but maybe it doesn't mean ill-formed at the time it was written, but quite a few years before" just doesn't make sense to me at all.


I can understand how someone who comes along later could say C0 can't be
followed by any continuation character that doesn't yield an overlong,
therefore C0 is a maximal subsequence.

Yes indeed, because maximal subsequences are defined by reference to well-formed/ill-formed subsequences, and what's ill-formed is defined in the same standard at the same time.

There's nobody "coming along later". That kind of wording would be appropriate if the PRI and the recommendation in the standard had been made e.g. in the 1990ies, before the tightening of the UTF-8 definition. Then somebody could say that Unicode overlooked that they implicitly changed the recommendation for how to generate U+FFFDs by changing the definition of well-formed UTF-8.

But no such thing at all happened. The PRI was evaluated, and the recommendation included in the text of Unicode, in the context of the then-existing (and since then unchanged) definition of UTF-8.


But I assert that my interpretation is just as valid as that one.

Sorry, but it cannot be valid, because of the timing. The tightening of the UTF-8 definition happened years before the PRI.


And perhaps more so, because of historical precedent.

It's good to know that there are older implementations that behave differently. And I understand that in some cases, these might be reluctant to change. Mine, and Henri's, comments are very much motivated by the fact that we are reluctant to change our implementations.

It may be worth to think about whether the Unicode standard should mention implementations like yours. But there should be no doubt about the fact that the PRI and Unicode 5.2 (and the current version of Unicode) are clear about what they recommend, and that that recommendation is based on the definition of UTF-8 at that time (and still in force), and not at based on a historical definition of UTF-8.

Regards,   Martin.

Reply via email to