[Python-Dev] Re: PEP 624: Remove Py_UNICODE encoder APIs

M.-A. Lemburg Wed, 08 Jul 2020 13:48:23 -0700

Hi Inada-san,

I am currently too busy with EuroPython to participate in longer
discussions. FWIW: I intend to continue after EuroPython.


In any case, thanks for writing up the PEP. Could you please add my
points about:

- the fact that the encode APIs encoding from a Unicode buffer
  to a bytes object; this is an important fact, since the removal
  removes access to this codec functionality for extensions

- PyUnicode_AsEncodedString() is not a proper alternative, since
  it requires to create a temporary PyUnicode object, which is
  inefficient and wastes memory

- the maintenance effect mentioned in the PEP does not really
  materialize, since the underlying functionality still exists
  in the codecs - only access to the functionality is removed

- keeping just the generic PyUnicode_Encode() API would be a
  compromise

- if we remove the codec specific PyUnicode_Encode*() APIs, why
  are we still keeping the specisl PyUnicde_Decode*() APIs ?

- the deprecations were just done because the Py_UNICODE data
  type was replaced by a hybrid type. Using this as an argument
  for removing functionality is not really good practice, when
  these are ways to continue exposing the functionality using other
  data types.

I am still strongly -1 on removing all encoding APIs without
at least some upgrade path for existing code to use and keeping
the API symmetric.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/


On 07.07.2020 17:17, Inada Naoki wrote:
> Hi, folks.
> 
> Since the previous discussion was suspended without consensus, I wrote
> a new PEP for it. (Thank you Victor for reviewing it!)
> 
> This PEP looks very similar to PEP 623 "Remove wstr from Unicode",
> but for encoder APIs, not for Unicode object APIs.
> 
> URL (not available yet): https://www.python.org/dev/peps/pep-0624/
> 
> ---
> 
> PEP: 624
> Title: Remove Py_UNICODE encoder APIs
> Author: Inada Naoki <songofaca...@gmail.com>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 06-Jul-2020
> Python-Version: 3.11
> 
> 
> Abstract
> ========
> 
> This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in
> Python 3.11:
> 
> * ``PyUnicode_Encode()``
> * ``PyUnicode_EncodeASCII()``
> * ``PyUnicode_EncodeLatin1()``
> * ``PyUnicode_EncodeUTF7()``
> * ``PyUnicode_EncodeUTF8()``
> * ``PyUnicode_EncodeUTF16()``
> * ``PyUnicode_EncodeUTF32()``
> * ``PyUnicode_EncodeUnicodeEscape()``
> * ``PyUnicode_EncodeRawUnicodeEscape()``
> * ``PyUnicode_EncodeCharmap()``
> * ``PyUnicode_TranslateCharmap()``
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``
> 
> .. note::
> 
>    `PEP 623  <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
>    Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
>    is not relating to Unicode object. These PEPs are split because they have
>    different motivation and need different discussion.
> 
> 
> Motivation
> ==========
> 
> In general, reducing the number of APIs that have been deprecated for
> a long time and have few users is a good idea for not only it
> improves the maintainability of CPython, but it also helps API users
> and other Python implementations.
> 
> 
> Rationale
> =========
> 
> Deprecated since Python 3.3
> ---------------------------
> 
> ``Py_UNICODE`` and APIs using it are deprecated since Python 3.3.
> 
> 
> Inefficient
> -----------
> 
> All of these APIs are implemented using ``PyUnicode_FromWideChar``.
> So these APIs are inefficient when user want to encode Unicode
> object.
> 
> 
> Not used widely
> ---------------
> 
> When searching from top 4000 PyPI packages [1]_, only pyodbc use
> these APIs.
> 
> * ``PyUnicode_EncodeUTF8()``
> * ``PyUnicode_EncodeUTF16()``
> 
> pyodbc uses these APIs to encode Unicode object into bytes object.
> So it is easy to fix it. [2]_
> 
> 
> Alternative APIs
> ================
> 
> There are alternative APIs to accept ``PyObject *unicode`` instead of
> ``Py_UNICODE *``. Users can migrate to them.
> 
> 
> =========================================
> ==========================================
> Deprecated API                            Alternative APIs
> =========================================
> ==========================================
> ``PyUnicode_Encode()``                    ``PyUnicode_AsEncodedString()``
> ``PyUnicode_EncodeASCII()``               ``PyUnicode_AsASCIIString()`` \(1)
> ``PyUnicode_EncodeLatin1()``              ``PyUnicode_AsLatin1String()`` \(1)
> ``PyUnicode_EncodeUTF7()``                \(2)
> ``PyUnicode_EncodeUTF8()``                ``PyUnicode_AsUTF8String()`` \(1)
> ``PyUnicode_EncodeUTF16()``               ``PyUnicode_AsUTF16String()`` \(3)
> ``PyUnicode_EncodeUTF32()``               ``PyUnicode_AsUTF32String()`` \(3)
> ``PyUnicode_EncodeUnicodeEscape()``       
> ``PyUnicode_AsUnicodeEscapeString()``
> ``PyUnicode_EncodeRawUnicodeEscape()``
> ``PyUnicode_AsRawUnicodeEscapeString()``
> ``PyUnicode_EncodeCharmap()``             ``PyUnicode_AsCharmapString()`` \(1)
> ``PyUnicode_TranslateCharmap()``          ``PyUnicode_Translate()``
> ``PyUnicode_EncodeDecimal()``              \(4)
> ``PyUnicode_TransformDecimalToASCII()``    \(4)
> =========================================
> ==========================================
> 
> Notes:
> 
> (1)
>    ``const char *errors`` parameter is missing.
> 
> (2)
>    There is no public alternative API. But user can use generic
>    ``PyUnicode_AsEncodedString()`` instead.
> 
> (3)
>    ``const char *errors, int byteorder`` parameters are missing.
> 
> (4)
>    There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
>    can be used instead. CPython uses
>    ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
>    from Unicode to numbers instead.
> 
> 
> Plan
> ====
> 
> Python 3.9
> ----------
> 
> Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
> already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
> already.
> 
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``.
> 
> Document all APIs as "will be removed in version 3.11".
> 
> 
> Python 3.11
> -----------
> 
> These APIs are removed.
> 
> * ``PyUnicode_Encode()``
> * ``PyUnicode_EncodeASCII()``
> * ``PyUnicode_EncodeLatin1()``
> * ``PyUnicode_EncodeUTF7()``
> * ``PyUnicode_EncodeUTF8()``
> * ``PyUnicode_EncodeUTF16()``
> * ``PyUnicode_EncodeUTF32()``
> * ``PyUnicode_EncodeUnicodeEscape()``
> * ``PyUnicode_EncodeRawUnicodeEscape()``
> * ``PyUnicode_EncodeCharmap()``
> * ``PyUnicode_TranslateCharmap()``
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``
> 
> 
> Alternative ideas
> =================
> 
> Instead of just removing deprecated APIs, we may be able to use thier
> names with different signature.
> 
> 
> Make some private APIs public
> ------------------------------
> 
> ``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
> 
> Some APIs have alternative public APIs. But they are missing
> ``const char *errors`` or ``int byteorder`` parameters.
> 
> We can rename some private APIs and make them public to cover missing
> APIs and parameters.
> 
> ============================= ================================
>  Rename to                     Rename from
> ============================= ================================
> ``PyUnicode_EncodeASCII()``    ``_PyUnicode_AsASCIIString()``
> ``PyUnicode_EncodeLatin1()``   ``_PyUnicode_AsLatin1String()``
> ``PyUnicode_EncodeUTF7()``     ``_PyUnicode_EncodeUTF7()``
> ``PyUnicode_EncodeUTF8()``     ``_PyUnicode_AsUTF8String()``
> ``PyUnicode_EncodeUTF16()``    ``_PyUnicode_EncodeUTF16()``
> ``PyUnicode_EncodeUTF32()``    ``_PyUnicode_EncodeUTF32()``
> ============================= ================================
> 
> Pros:
> 
> * We have more consistent API set.
> 
> Cons:
> 
> * We have more public APIs to maintain.
> * Existing public APIs are enough for most use cases, and
>   ``PyUnicode_AsEncodedString()`` can be used in other cases.
> 
> 
> Replace ``Py_UNICODE*`` with ``Py_UCS4*``
> -----------------------------------------
> 
> We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
> ``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
> convert ``Py_UCS4*`` string to Unicode object.
> 
> 
> Pros:
> 
> * We have more consistent API set.
> * User can encode UCS-4 string in C without creating Unicode object.
> 
> Cons:
> 
> * We have more public APIs to maintain.
> * Applications which uses UTF-8 or UTF-32 can not use these APIs
>   anyway.
> * Other Python implementations may not have builtin codec for UCS-4.
> * If we change the Unicode internal representation to UTF-8, we need
>   to keep UCS-4 support only for these APIs.
> 
> 
> Replace ``Py_UNICODE*`` with ``wchar_t*``
> -----------------------------------------
> 
> We can replace ``Py_UNICODE`` to ``wchar_t``.
> 
> Pros:
> 
> * We have more consistent API set.
> * Backward compatible.
> 
> Cons:
> 
> * We have more public APIs to maintain.
> * They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
>   because built-in codecs supports only UCS-1, UCS-2, and UCS-4
>   input.
> 
> 
> Rejected ideas
> ==============
> 
> Using runtime warning
> ---------------------
> 
> These APIs doesn't release GIL for now. Emitting a warning from
> such APIs is not safe. See this example.
> 
> .. code-block::
> 
>    PyObject *u = PyList_GET_ITEM(list, i);  // u is borrowed reference.
>    PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
>            PyUnicode_GET_SIZE(u), NULL);
>    // Assumes u is still living reference.
>    PyObject *t = PyTuple_Pack(2, u, b);
>    Py_DECREF(b);
>    return t;
> 
> If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
> filters and other threads may change the ``list`` and ``u`` can be
> a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
> 
> Additionally, since we are not changing behavior but removing C APIs,
> runtime ``DeprecationWarning`` might not helpful for Python
> developers. We should warn to extension developers instead.
> 
> 
> Discussions
> ===========
> 
> * `Plan to remove Py_UNICODE APis except PEP 623
>   
> <https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
> * `bpo-41123: Remove Py_UNICODE APIs except PEP 623:
> <https://bugs.python.org/issue41123>`_
> 
> 
> References
> ==========
> 
> .. [1] Source package list chosen from top 4000 PyPI packages.
>    
> (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)
> 
> .. [2] pyodbc -- Don't use PyUnicode_Encode API #792
>    (https://github.com/mkleehammer/pyodbc/pull/792)
> 
> .. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)
>    
> (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)
> 
> 
> Copyright
> =========
> 
> This document has been placed in the public domain.
> 
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QT7QVAKF36Y2GOXNPXZ5AGKWGKZI3XT7/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PEP 624: Remove Py_UNICODE encoder APIs

Reply via email to