On 02.02.2021 00:33, Inada Naoki wrote: > On Tue, Feb 2, 2021 at 12:43 AM M.-A. Lemburg <m...@egenix.com> wrote: >> >> Hi Inada-san, >> >> thank you for adding some comments, but they are not really capturing >> what I think is missing: >> >> """ >> Removing these APIs removes ability to use codec without temporary Unicode. >> >> Codecs can not encode Unicode buffer directly without temporary Unicode >> object since Python 3.3. All these APIs creates temporary Unicode object for >> now. So removing them doesn't reduce any abilities. >> """ >> >> The point is that while the decoders allow going from a C object >> to a Python object directly, we are missing a way to do the same >> for the encoders, since the Python 3.3 change in the Unicode internals. >> >> At the very least, we should have such APIs for going from wchar_t* >> to a Python object. > > We already have PyUnicode_FromWideChar(). So I assume you mean > "wchar_t* to Python bytes object".
Yes, that's what I meant. Encoding from wchar_t* to a Python bytes object. This is what the encoder APIs all implement. They have become less efficient with Python 3.3, but this can be resolved, while at the same time removing Py_UNICODE and replacing it with wchar_t in those encoder APIs. >> >> The alternatives you provide all require creating an intermediate >> Python object for this purpose. The APIs you want to remove do that >> as well, but that's not the point. The point is to expose the codecs' >> decode mechanism which is available in the C code, but currently >> not exposed via C APIs, e.g. ucs4lib_utf8_encode(). >> >> It would be breaking change, but those APIs in your list could >> simply be changed from using Py_UNICODE to using whcar_t instead >> and then interface directly to the internal functions we have for >> the encoders. >> > > OK, I see codecs.h has three encoders. > > * utf8_encode > * utf16_encode > * utf32_encode > > But there are 13 encoders in my PEP: > > PyUnicode_Encode() > PyUnicode_EncodeASCII() > PyUnicode_EncodeLatin1() > PyUnicode_EncodeUTF7() > PyUnicode_EncodeUTF8() > PyUnicode_EncodeUTF16() > PyUnicode_EncodeUTF32() > PyUnicode_EncodeUnicodeEscape() > PyUnicode_EncodeRawUnicodeEscape() > PyUnicode_EncodeCharmap() > PyUnicode_TranslateCharmap() > PyUnicode_EncodeDecimal() > PyUnicode_TransformDecimalToASCII() > > Do you want to keep all encoders? or 3 encoders? We could keep all encoders, replacing Py_UNICODE with wchar_t in the API. For the ones where we have separate implementations as private functions, we can move back to direct encoding. For the others, we can keep using the temporary Unicode object or refactor the code to expose the native encoders working directly on the internal buffers as private functions and then use those in the same way for direct encoding. The Unicode API was meant and designed as a rich API, making it easy to use and providing a complete set for extension writers and CPython to use. I believe we should keep it that way. >> That would keep extensions working after a recompile, since >> Py_UNICODE is already a typedef to wchar_t. >> > > That idea is written in the PEP already. > https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t Right and I think this is a more workable approach than removing APIs. BTW: I don't understand this comment: "They are inefficient on platforms wchar_t* is UTF-16. It is because built-in codecs supports only UCS-1, UCS-2, and UCS-4 input." Windows is one such platform. Java (indirectly) is another. They both store UTF-16LE in those arrays and Python's codecs handle this just fine. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Feb 02 2021) >>> Python Projects, Coaching and Support ... https://www.egenix.com/ >>> Python Product Development ... https://consulting.egenix.com/ ________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PRFDSXHVNITI5PKQPI7DJJJ6DPIKRYM5/ Code of Conduct: http://python.org/psf/codeofconduct/