[issue42157] Cleanup the unicodedata module

STINNER Victor Mon, 26 Oct 2020 10:05:43 -0700


New submission from STINNER Victor <[email protected]>:


Mohamed Koubaa and me are trying to convert the unicodedata module to the 
multi-phase initialization API (PEP 489) and to convert the UCD static type to 
a heap type in bpo-1635741.

The unicodedata extension module has some special cases:

* It has a C API exposes in Python as the "unicodedata.ucnhash_CAPI" PyCapsule 
object.
* In C, the unicodedata_functions array is used to define module functions 
*AND* unicodedata.UCD methods. It is unused to do that and makes the conversion 
more tricky.
* Most C functions have a "self" parameter which is used to choose between the 
current version of the Unicode database and the version 3.2.0 
("unicodedata.ucd_3_2_0").

There is also a unicodedata.UCD type which cannot be instanciated in Python. It 
is only used to create the unicodedata.ucd_3_2_0 instance.

In the commit 47e1afd2a1793b5818a16c41307a4ce976331649, I moved the private 
_PyUnicode_Name_CAPI structure to internal C API.

In the commit ddc0dd001a4224274ba6f83568b45a1dd88c6fc6, Mohammed added a 
ucd_type parameter to the UCD_Check() macro. I asked him to do that.

In the commit e6b8c5263a7fcf5b95d0fd4c900e5949eeb6630d, I added a "global 
module state" and a "state" parameter to most functions. This change prepares 
the code base to pass a UCD type instance to functions, to be able to have more 
than once UCD type when it will be converted to a heap type, one type per 
module instance.

The technical problem is that unicodedata_functions is used for module 
functions and UCD methods. Duplicating unicodedata_functions requires to 
duplicate a lot of code and comments.

Sadly, it does not seem easily possible to retrieve the "module state" ("state" 
variable) in functions since unicodedata_functions is reused for module 
functioins and UCD methods. Using "defining_class" in Argument Clinic would 
require to duplicate all unicodedata_functions functions, one flavor for module 
functions, one flavor for UCD type. It would also require to duplicate all 
docstrings, which means to increase the maintenance burden and introduce a risk 
of having inconsistencies.

Maybe we could introduce a new UCD instance which would be mapped to the 
current Unicode Character Database version, and module functions which be 
bounded methods of this type. But it sounds overkill to me.

By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that 
it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I 
propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD 
type. In Python 3.12, we will be able to remove a lot of code, and simplify the 
code a lot.

For now, we can convert unicodedata to the multi-phase initilization API (PEP 
489) and convert UCD static type to a heap type by avoiding references to the 
UCD type. Rather than checking if self is an instance of UCD_Type, we can check 
if it is not a module (PyModule_Check). This is exactly what Mohammed proposed 
in the first place, but I misunderstood the whole issue and gave him bad 
advices.

----------
components: Library (Lib)
messages: 379673
nosy: vstinner
priority: normal
severity: normal
status: open
title: Cleanup the unicodedata module
versions: Python 3.10

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue42157>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42157] Cleanup the unicodedata module

Reply via email to