[issue45869] Unicode and acii regular expressions do not agree on ascii space characters

Joran van Apeldoorn Tue, 23 Nov 2021 04:43:30 -0800


Joran van Apeldoorn <g...@blubmail.nl> added the comment:


Hi,

I was not suggesting that the documentation literally says they should be the 
same but it might be unexpected for users if ASCCI characters change properties 
depending on whether they are considered in a unicode or pure ASCII setting. 

The documentation says about re.A: "Make \w, \W, \b, \B, \d, \D, \s and \S 
perform ASCII-only matching instead of full Unicode matching. ". The problem 
might be that there is no clear notion of "ASCII-only matching". I assumed this 
mean matching ASCII characters only, i.e., the character classes are simply 
limited to codes below 128. 

About \s the documentation says:
"Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also 
many other characters, for example the non-breaking spaces mandated by 
typography rules in many languages). If the ASCII flag is used, only [ 
\t\n\r\f\v] is matched.". This heavily implies that there are non-ASCII 
characters in Unicode that might be considered spaces, but that the ASCII 
characters are [ \t\n\r\f\v], although again, not stated literally. 

There might be valid reasons to change the definition (even for ASCII 
characters) depending on re.A, but should it then not follow the unicode 
standard for white space in the unicode case? (which would coincide with the 
current ASCII case). There seem to be many different places where python is 
opinionated about what a space is, but not much consistency behind it.

I am a bit worried about the undocumented nature of the precise definitions of 
the regex classes in general. How is a user supposed to know that the default 
behavior of \s, when no flag is passed, is to also match other ASCII characters 
then those mentioned for the ASCII case? In contrast to this, the \d class is 
directly defined as the unicode category [Nd]. 

It is likely to hard to change and to many things depend on it but the 
following definitions would make more sense to me, and hopefully others:
- Character classes are defined as a set of unicode properties/categories, 
following the same definitions as elsewhere in python.
- If re.A is passed, they are this same set but limited to codes below 128. 

After some digging in the code I traced the current definitions as follows:
 - For unicode Py_UNICODE_ISSPACE is called, which either does a lookup in the 
constant table _Py_ascii_whitespace or calls _PyUnicode_IsWhitespace for non 
ASCII characters. Both of these define a space as "Unicode characters having 
the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'", i.e., this is 
simply the unicode string isspace() definition. 
 - For ASCII Py_ISSPACE is called which does a lookup in _Py_ctype_table. It is 
unclear to me how this table was made.

So sre just follows the other python definitions.
In searching around i found issue  #18236 , which also considers how the python 
definition differs from the unicode one.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45869>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45869] Unicode and acii regular expressions do not agree on ascii space characters

Reply via email to