On 6/30/20 8:43 AM, Emily Bowman wrote: > I completely agree with this, that UTF-8 has become the One True > Encoding(tm), and UCS-2 and UTF-16 are hardly found anywhere outside > of the Win32 API. Nearly all basic emoji can't be represented in UCS-2 > wchar_t, let alone composite emoji. > > So how to make that C-compatible? Make everything a void* and it just > comes back with as many bytes as it gets?
Actually, in C you would tend to represent UTF-8 as a char* (or maybe an unsigned char*) type. This points out that straight 'ASCII' strings are also UTF-8, and that many of the string functions will actually work ok with UTF-8 strings. This was an intentional part of the design of UTF-8. Anything looking for specific character values will tend to 'just work', as long as those values really represent a character. The code also needs to take account of that now bytes != characters, so if you want to actually count how many characters are in a string, you need to be aware, and avoid splitting a string in the middle of a code-point, but a lot will still just work. -- Richard Damon _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YQWLGMN2M4JDVFSOYGFMOPUB7QAAWH2U/ Code of Conduct: http://python.org/psf/codeofconduct/