> > Citing directly from the PRI: > > >>>> > The term "maximal subpart of the ill-formed subsequence" refers to the > longest potentially valid initial subsequence or, if none, then to the next > single code unit. > >>>> >
The way i understand it is that C0 80 will have TWO maximal subparts, because there's not any valid initial subsequence, so only the next single code unit (C0) will be considered. After this the following byte 80 also has not any valid initial subsequence, so here again only the next single code unit (80) will be considered. You'll get U+FFFD replacements emitted twice. This treats all cases of "overlong" sequences that were in the old UTF-8 definition in the first RFC. For C3 80 20, there will be only ONE maximal subpart because C3 80 is a valid initial subsequence, so a single U+FFFD replacement will be emitted, followed then by the valid UTF-8 sequence (20) which will correctly decode as U+0020. Good ! This means that this proposal makes sense and is compatible with random accesses within the encoded text whithout having to look backward for an indefinite number of code units and we never have to handle any case with possibly infinite number of code units mapped to the same U+FFFD replacement.