On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <pub...@khwilliamson.com> wrote:
> On 05/24/2017 12:46 AM, Martin J. Dürst wrote: > >> That's wrong. There was a public review issue with various options and >> with feedback, and the recommendation has been implemented and in use >> widely (among else, in major programming language and browsers) without >> problems for quite some time. >> > > Could you supply a reference to the PRI and its feedback? > http://www.unicode.org/review/resolved-pri-100.html#pri121 The PRI did not discuss possible different versions of "maximal subpart", and the examples there yield the same results either way. (No non-shortest forms.) The recommendation in TUS 5.2 is "Replace each maximal subpart of an > ill-formed subsequence by a single U+FFFD." > You are right. http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly expanded example compared with the PRI. The text simply talked about a "conversion process" stopping as soon as it encounters something that does not fit, so these edge cases would depend on whether the conversion process treats original-UTF-8 sequences as single units. And I agree with that. And I view an overlong sequence as a maximal > ill-formed subsequence that should be replaced by a single FFFD. There's > nothing in the text of 5.2 that immediately follows that recommendation > that indicates to me that my view is incorrect. > > Perhaps my view is colored by the fact that I now maintain code that was > written to parse UTF-8 back when overlongs were still considered legal > input. An overlong was a single unit. When they became illegal, the code > still considered them a single unit. > Right. I can understand how someone who comes along later could say C0 can't be > followed by any continuation character that doesn't yield an overlong, > therefore C0 is a maximal subsequence. > Right. But I assert that my interpretation is just as valid as that one. And > perhaps more so, because of historical precedent. > I agree. markus