[Python-Dev] Re: PEP 624: Remove Py_UNICODE encoder APIs

M.-A. Lemburg Tue, 02 Feb 2021 02:39:32 -0800

On 02.02.2021 00:33, Inada Naoki wrote:
> On Tue, Feb 2, 2021 at 12:43 AM M.-A. Lemburg <[email protected]> wrote:
>>
>> Hi Inada-san,
>>
>> thank you for adding some comments, but they are not really capturing
>> what I think is missing:
>>
>> """
>> Removing these APIs removes ability to use codec without temporary Unicode.
>>
>>     Codecs can not encode Unicode buffer directly without temporary Unicode
>> object since Python 3.3. All these APIs creates temporary Unicode object for
>> now. So removing them doesn't reduce any abilities.
>> """
>>
>> The point is that while the decoders allow going from a C object
>> to a Python object directly, we are missing a way to do the same
>> for the encoders, since the Python 3.3 change in the Unicode internals.
>>
>> At the very least, we should have such APIs for going from wchar_t*
>> to a Python object.
> 
> We already have PyUnicode_FromWideChar(). So I assume you mean
> "wchar_t* to Python bytes object".


Yes, that's what I meant. Encoding from wchar_t* to a Python bytes
object. This is what the encoder APIs all implement. They have become
less efficient with Python 3.3, but this can be resolved, while
at the same time removing Py_UNICODE and replacing it with wchar_t
in those encoder APIs.

>>
>> The alternatives you provide all require creating an intermediate
>> Python object for this purpose. The APIs you want to remove do that
>> as well, but that's not the point. The point is to expose the codecs'
>> decode mechanism which is available in the C code, but currently
>> not exposed via C APIs, e.g. ucs4lib_utf8_encode().
>>
>> It would be breaking change, but those APIs in your list could
>> simply be changed from using Py_UNICODE to using whcar_t instead
>> and then interface directly to the internal functions we have for
>> the encoders.
>>
> 
> OK, I see codecs.h has three encoders.
> 
> * utf8_encode
> * utf16_encode
> * utf32_encode
>
> But there are 13 encoders in my PEP:
> 
> PyUnicode_Encode()
> PyUnicode_EncodeASCII()
> PyUnicode_EncodeLatin1()
> PyUnicode_EncodeUTF7()
> PyUnicode_EncodeUTF8()
> PyUnicode_EncodeUTF16()
> PyUnicode_EncodeUTF32()
> PyUnicode_EncodeUnicodeEscape()
> PyUnicode_EncodeRawUnicodeEscape()
> PyUnicode_EncodeCharmap()
> PyUnicode_TranslateCharmap()
> PyUnicode_EncodeDecimal()
> PyUnicode_TransformDecimalToASCII()
> 
> Do you want to keep all encoders? or 3 encoders?

We could keep all encoders, replacing Py_UNICODE with wchar_t
in the API.

For the ones where we have separate implementations
as private functions, we can move back to direct encoding.

For the others, we can keep using the temporary Unicode object
or refactor the code to expose the native encoders working
directly on the internal buffers as private functions
and then use those in the same way for direct encoding.

The Unicode API was meant and designed as a rich API, making
it easy to use and providing a complete set for extension
writers and CPython to use. I believe we should keep it that
way.

>> That would keep extensions working after a recompile, since
>> Py_UNICODE is already a typedef to wchar_t.
>>
> 
> That idea is written in the PEP already.
> https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t

Right and I think this is a more workable approach than removing
APIs.

BTW: I don't understand this comment:
"They are inefficient on platforms wchar_t* is UTF-16. It is because
built-in codecs supports only UCS-1, UCS-2, and UCS-4 input."

Windows is one such platform. Java (indirectly) is another. They both
store UTF-16LE in those arrays and Python's codecs handle this just
fine.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 02 2021)
>>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>>> Python Product Development ...        https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/
_______________________________________________
Python-Dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/PRFDSXHVNITI5PKQPI7DJJJ6DPIKRYM5/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 624: Remove Py_UNICODE encoder APIs

Reply via email to