[Python-Dev] Re: PEP 624: Remove Py_UNICODE encoder APIs

M.-A. Lemburg Mon, 01 Feb 2021 08:41:07 -0800

On 01.02.2021 17:10, Victor Stinner wrote:
> On Mon, Feb 1, 2021 at 4:47 PM M.-A. Lemburg <m...@egenix.com> wrote:
>> At the very least, we should have such APIs for going from wchar_t*
>> to a Python object.
>>
>> The alternatives you provide all require creating an intermediate
>> Python object for this purpose.
> 
> We cannot optimize all use cases. IMO we should only optimize
> conversions between char* and Python object.
> 
> I don't see the need for two conversions (char* => Python and then
> Python => wchar_t*) as an issue if you need wchar_t*.


The C code is already there, but it got hidden away in the
Python 3.3 change to new internals.

All that needs to be done is remove the intermediate Python
Unicode object creation and have those encoder APIs again
interface to the native C code.

> Objects/unicodeobject.c is already very complex with specialization
> for ASCII, Py_UCS1 (latin1), Py_UCS2 and Py_UCS4 kinds: 16k lines of C
> code. I would prefer to make it simpler than more complex.
> 
> Internally, functions like PyUnicode_EncodeLatin1() already do the two
> conversions. So it's not like the PEP has any impact on performance.

Before Python 3.3 all those APIs interfaced directly to the
C codec functions. The introduction of an intermediate Python
Unicode object was just done as quick work-around, even
though it was not really needed, since Python 3.3 did not
remove the C code of the encoders.

>> That would keep extensions working after a recompile, since
>> Py_UNICODE is already a typedef to wchar_t.
> 
> Extensions should not use Py_UNICODE*/wchar_t*.

They should not use Py_UNICODE.

wchar_t is standard C and is in wide spread use in C code for
storing Unicode data. This was one of the main reason for
introducing UCS4 Python versions for Linux in the mid 2000s,
since Linux apps used 4 byte wchar_t as native storage format.

My point is that extensions would just need a recompile
with the change from Py_UNICODE to wchar_t, since Py_UNICODE
and wchar_t are already the same thing in Python 3.3+.

> Can you explain where wchar_t* type is appropriate and how two
> conversions is a performance bottleneck?

If an extension has a wchar_t* string, it should be easy
to convert this in to a Python bytes object for use in Python.

Just like it should be easy to go from a char* string to
a Python str object.

The PEP breaks this symmetry by removing access to the
encoder implementations.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 01 2021)
>>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>>> Python Product Development ...        https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FSUPT6B26VJT7S6UCW4RYWRQ3LYLUINU/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 624: Remove Py_UNICODE encoder APIs

Reply via email to