On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode <unicode@unicode.org> wrote:
> > > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > > > On Tue, 11 Sep 2018 21:10:03 +0200 > > Hans Åberg via Unicode <unicode@unicode.org> wrote: > > > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using > >> LaTeX files with sections in different Cyrillic and Latin encodings, > >> changing the editor encoding while typing. > > > > Rather like some of the old Unicode list archives, which are just > > concatenations of a month's emails, with all sorts of 8-bit encodings > > and stretches of base64. > > It might be useful to represent non-UTF-8 bytes as Unicode code points. > One way might be to use a codepoint to indicate high bit set followed by > the byte value with its high bit set to 0, that is, truncated into the > ASCII range. For example, U+0080 looks like it is not in use, though I > could not verify this. > > it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80 (I'm probably off a bit in the leading byte) UTF-8 can represent from 0 to 0x200000 every value; (which is all defined codepoints) early varients can support up to U+7FFFFFFF... and there's enough bits to carry the pattern forward to support 36 bits or 42 bits... (the last one breaking the standard a bit by allowing a byte wihout one bit off... 0xFF would be the leadin) 0xF8-FF are unused byte values; but those can all be encoded into utf-8.