On 18 May 2017, at 06:01, Richard Wordingham via Unicode <unicode@unicode.org> wrote: > > On Thu, 18 May 2017 02:04:55 +0200 > Philippe Verdy via Unicode <unicode@unicode.org> wrote: > >> I find intriguating that the update intends to enforce the decoding >> of the **shortest** sequences, but now wants to treat **maximal >> sequences** as a single unit with arbitrary length. UTF-8 was >> designed to work only with some state machines that would NEVER need >> to parse more than 4 bytes. > > If you look at the sample code in > http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that > it's working with 6-byte sequences. It's the Unicode, as opposed to > ISO 10646, version that has always been restricted to 4 bytes.
There are good reasons for restricting it to four byte sequences, mind; doing so increases the number of invalid code units, which makes it easier to detect UTF-8 versus not UTF-8. I don’t think anyone is proposing allowing 5-byte or 6-byte sequences. Kind regards, Alastair. -- http://alastairs-place.net