STINNER Victor <[email protected]> added the comment:
> Should it not simply return “i”?
Python implements the Unicode standard.
>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']
>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into
Python internal Unicode database.
U+0130 character enters the EXTENDED_CASE_MASK case: use
_PyUnicode_ExtendedCase secondary database for "extended case".
Well, at the end, Python uses the following data file from the Unicode standard:
https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt
Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""
If you want to convert strings differently for the special case of Turkish, you
need to use a different standard than Unicode...
I close the issue as NOT A BUG.
----------
resolution: -> not a bug
stage: -> resolved
status: open -> closed
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34723>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com