I looked into this extensively when I was working on CFString, and came to
the conclusion that that was probably the path of least resistance.

But just to clarify, the Unicode situation is even more complicated than
that. Serogates are considered reserved character, and not allowed in
UTF-8. So to find the length of a UTF-8 string in the UTF-16 encoding you
have to decode the entire string to UTF-32, check that there are no
serogates (those should be treated as illegal), if any character is >0xffff
then it must checked for a valid surrogate pair.

Regardless of what is done, you'll end up delicate situation. The UTF
encodings must be constantly error checked, because there's always a chance
that all this back and forth can introduce an invalid character.

On top of it all, the UTF-16 serrogate pairs can only encode up to 0x10ffff
characters, which means if/when this limit is reached, a new encoding will
have to be devised.

On Sat, Apr 7, 2018, 04:49 David Chisnall <gnus...@theravensnest.org> wrote:

> On 5 Apr 2018, at 20:09, Stefan Bidigaray <stefanb...@gmail.com> wrote:
> >
> > I know this is probably going to be rejected, but how about making
> constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I
> know this would increase the byte count for most European languages using
> Latin characters, but I don't see the point of maintaining both UTF-8 and
> UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in
> UTF-8 (and vise-versa), so how would the compiler pick between the two?
> Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the
> code significantly?
>
> I am leaning in this direction.  The APIs all want UTF-16 codepoints.  In
> ASCII, each character is precisely one UTF-16 codepoint.  In UTF-16, every
> two-byte value is a UTF-16 codepoint.  In UTF-8, UTF-16 codepoints are
> somewhere between 1 and 3 characters long and the mapping is complicated.
> It’s a shame that in the 64-bit transition Apple didn’t make unichar 32
> bits and make it a unicode character, so we’re stuck in the same situation
> of Windows with a hasty s/UCS2/UTF-16/ and an attempt to make the APIs keep
> working.
>
> My current plan is to make the format support ASCII, UTF-8, UTF-16, and
> UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide
> later if we want to support generating UTF-8 and UTF-32.  I also won’t
> initialise the hash in the compiler initially, until we’ve decided a bit
> more what the hash should be.
>
> David
>
>
_______________________________________________
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev

Reply via email to