On 6/30/20 8:43 AM, Emily Bowman wrote:
> I completely agree with this, that UTF-8 has become the One True
> Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside
> of the Win32 API. Nearly all basic emoji can't be represented in UCS-2
> wchar_t, let alone composite emoji.
>
> So how to make that C-compatible? Make everything a void* and it just
> comes back with as many bytes as it gets?

Actually, in C you would tend to represent UTF-8 as a char* (or maybe an
unsigned char*) type. This points out that straight 'ASCII' strings are
also UTF-8, and that many of the string functions will actually work ok
with UTF-8 strings. This was an intentional part of the design of UTF-8.
Anything looking for specific character values will tend to 'just work',
as long as those values really represent a character. The code also
needs to take account of that now bytes != characters, so if you want to
actually count how many characters are in a string, you need to be
aware, and avoid splitting a string in the middle of a code-point, but a
lot will still just work.

-- 
Richard Damon
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YQWLGMN2M4JDVFSOYGFMOPUB7QAAWH2U/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to