[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

Victor Stinner Tue, 30 Jun 2020 06:27:59 -0700

Le mar. 30 juin 2020 à 13:53, M.-A. Lemburg <m...@egenix.com> a écrit :
> > I would prefer to analyze the list on a case by case basis. I don't
> > think that it's useful to expose every single encoding supported by
> > Python as a C function.
>
> (...)
> This does not mean we have to give up the symmetry in the C API,
> or that the encoding APIs are now suddenly useless. It only means
> that we have to replace Py_UNICODE with one of the supported data
> for storing Unicode.


Let's agree to disagree :-)

I don't think that completeness is a good rationale to design the C API.

The C API is too large, we have to make it smaller. A specialized
function, like PyUnicode_AsUTF8String(), can be justified by different
reasons:

* It is a very common use case and so it helps to write C extensions
* It is significantly faster than the alternative generic function

In C, you can execute arbitrary Python code by calling methods on
Python str objects. For example, "abc".encode("utf-8",
"surrogateescape") in Python becomes PyObject_CallMethod(obj,
"encode", "ss", "utf-8", "surrogatepass") in C. Well, there is already
a more specialized and generic PyUnicode_AsEncodedObject() function.

We must not add a C API function for every single Python feature,
otherwise it would be too expensive to maintain, and it would become
impossible for other Python implementations to implement the fully C
API. Well, even today, PyPy already only implements a small subset of
the C API.


> Since the C world has adopted wchar_t for this purpose, it's the
> natural choice.

In my experience, in C extensions, there are two kind of data:

* bytes is used as a "char*": array of bytes
* Unicode is used as a Python object

For the very rare cases involving wchar_t*, PyUnicode_FromWideChar()
can be used. I don't think that performance justifies to duplicate
each function, once for a Python str object, once for wchar_t*. I
mostly saw code involving wchar_t* to initialize Python. But this code
was wrong since it used PyUnicode function *before* Python was
initialized. That's bad and can now crash in recent Python versions.
The new PEP 587 has a different design and avoids Python objects and
anything related to the Python runtime:
https://docs.python.org/dev/c-api/init_config.html#c.PyConfig_SetString

Moreover, CPython implements functions taking wchar_t* string by
calling PyUnicode_FromWideChar() internally...


> PyUnicode_AsEncodedString() converts Unicode objects to a
> bytes object. This is not an symmetric replacement for the
> PyUnicode_Encode*() APIs, since those go from Py_UNICODE to
> a bytes object.

I don't see which feature is missing from PyUnicode_AsEncodedString().
If it's about parameters specific to some encodings like UTF-7, I
already replied in another email.


> Since the C API is not only meant to be used by the CPython interpreter,
> we should stick to standards rather than expecting the world to adapt
> to our implementations. This also makes the APIs future proof, e.g.
> in case we make another transition from the current hybrid internal
> data type for Unicode towards UTF-8 buffers as internal data type.

Do you know C extensions in the wild which are using wchar_t* on
purpose? I haven't seen such a C extension yet.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YW4FDLEHFXA6NKWZYVFOBNGPE33VQJ7U/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Plan to remove Py_UNICODE APis except PEP 623.

Reply via email to