Proposed alternative encoding for stray UTF-8 bytes in strings

John Cowan Thu, 23 Nov 2023 23:34:52 -0800

(If this is too late in the process, I understand.  I think the required
code changes will be small and localized.)


>From the unicode-transition page:

The strategy that I favor in the moment is to handle all string data
> injected into the system transparently, the actual bytes are unchanged and
> unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low,
> trailing) UTF-16 surrogate pair half.


The trouble with this is that it means the internal representation is no
longer valid UTF-8, which may cause problems down the line, since it is
exposed to anyone dealing with bytevectors.

There is an alternative based on the little-known "noncharacter" range.
Despite the name, these really are perfectly valid characters, but Unicode
guarantees that they will never be assigned to anything in the Real World
and are reserved for internal use.[1]  I propose using them instead of the
surrogate space.  Unfortunately there aren't enough of them to assign one
to each possible stray byte, but we can assign one to each high and low
nybble of each stray byte, analogously to the way Planes 1 to 1F are
handled in UTF-16.

Specifically, given a stray byte whose hex representation is xy, we decode
it as the UTF-8 equivalent of U+FDDx U+FDEy, which is EF B7 9x EF B7 Ay in
the internal encoding, which is now valid UTF-8.  If any of these
noncharacters (coming from a UTF-8 or UTF-16 source) is to be decoded, we
escape it with the UTF-8 representation of U+FFFE, which is EF BF BE, so
that (say) external U+FDDA is decoded as EF BF BE EF B7 AA.  U+FFFE is also
used to escape itself, so it becomes EF BF BE EF BF BE internally.

I hope this is understandable.

[1] See https://www.unicode.org/versions/corrigendum9.html for details.

Proposed alternative encoding for stray UTF-8 bytes in strings

Reply via email to