New submission from James Gerity <[email protected]>:
The documentation for the `re` library¹ describes the behavior of the specifier
'\w' as matching "Unicode word characters," which is very vague. The closest
thing I can find that corresponds to this language is the guidance offered in
Unicode Technical Standard #18², which defines the class `<word_character>` to
include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH
NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct
description of `re`, however, as these zero-width characters are not counted
when matching '\w', e.g.:
```
>>> re.match('\w*', 'Auf\u200Clage')
<re.Match object; span=(0, 3), match='Auf'>
```
It seems from examining the CPython source³ that SRE treats '\w' as meaning any
alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any
Unicode class definition I've been able to find.
Can anyone provide clarification on what part of Unicode this documentation is
referring to? If there is some other definition, the documentation should be
more specific about referring to it (and including a link would be preferred).
If instead the documentation is incorrect, this language should be changed to
describe the true meaning of \w.
¹ https://docs.python.org/3/library/re.html#index-32
² http://unicode.org/reports/tr18/
³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125
----------
assignee: docs@python
components: Documentation
messages: 355239
nosy: docs@python, snoopjedi
priority: normal
severity: normal
status: open
title: Description of '\w' behavior is vague in `re` documentation
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue38566>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com