[issue26917] Inconsistency in unicodedata.normalize()?

Armin Rigo Tue, 03 May 2016 01:49:11 -0700

New submission from Armin Rigo:

There is an apparent inconsistency in unicodedata.normalize("NFC"), introduced 
with the switch from the Unicode DB 5.1.0 to 5.2.0 (in Python 2.7).  First, 
please note that my knowledge of unicode is limited, so I may be wrong and the 
following behavior might be perfectly correct.


>>> from unicodedata import normalize
>>> print(normalize("NFC", "---\uafb8\u11a7---").encode('utf-8'))
b'---\xea\xbe\xb8\xe1\x86\xa7---'    # i.e., the same as the input

>>> print(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1").encode('utf-8'))
b'---\xea\xbe\xb8---\xe3\xa4\xba'

Note how in the second example the initial two-character part is replaced with 
a single character (actually the first of them).  This does not occur in the 
first example.  In Python 2.6, both inputs would be normalized to the 
single-character output.

The new behavior introduced in Python 2.7 is to first do a quick-check on the 
string, and if this `is_normalized()` function returns 1, we know that the 
string should already be normalized and we return it unmodified.  However, the 
example "\uafb8\u11a7" shows a contradictory behavior: it causes both 
is_normalized() to return 1, but actual normalization to change it.  We can see 
in the second example above that if, for an unrelated reason, we force 
is_normalized() to return 0 (by adding some non-normalized character elsewhere 
in the string), then the "\uafb8\u11a7" is changed.

This is a bit unexpected, but I don't know if it is officially correct behavior 
or if the problem is a bug in `is_normalized()`.

----------
components: Unicode
messages: 264697
nosy: arigo, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Inconsistency in unicodedata.normalize()?
type: behavior
versions: Python 2.7, Python 3.6

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26917>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26917] Inconsistency in unicodedata.normalize()?

Reply via email to