[issue40845] idna encoding fails for Cherokee symbols

Roman Akopov Tue, 02 Jun 2020 13:38:56 -0700

Roman Akopov <[email protected]> added the comment:

This is how I extract data from Common Locale Data Repository v37
script assumes common\main working directory


from os import walk
from xml.etree import ElementTree

en_root = ElementTree.parse('en.xml')

for (dirpath, dirnames, filenames) in walk('.'):
    for filename in filenames:
        if filename.endswith('.xml'):
            code = filename[:-4]
            xx_root = ElementTree.parse(filename)
            xx_lang = 
xx_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')
            en_lang = 
en_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')

            if en_lang.text == 'Cherokee':
                print(en_lang.text)
                print(xx_lang.text)
                print(xx_lang.text.encode("unicode_escape"))
                print(xx_lang.text.encode('idna'))
                print(ord(xx_lang.text[0]))
                print(ord(xx_lang.text[1]))
                print(ord(xx_lang.text[2]))

script outputs

Cherokee
ᏣᎳᎩ
b'\\u13e3\\u13b3\\u13a9'
b'xn--tz9ata7l'
5091
5043
5033

If I change text to lower case

                print(en_lang.text.lower())
                print(xx_lang.text.lower())
                print(xx_lang.text.lower().encode("unicode_escape"))
                print(xx_lang.text.lower().encode('idna'))
                print(ord(xx_lang.text.lower()[0]))
                print(ord(xx_lang.text.lower()[1]))
                print(ord(xx_lang.text.lower()[2]))

then script outputs

cherokee
ꮳꮃꭹ
b'\\uabb3\\uab83\\uab79'
b'xn--tz9ata7l'
43955
43907
43897

I am not sure where do you get '\u13e3\u13b3\u13a9' string. 
'\u13e3\u13b3\u13a9'.lower().encode('unicode_escape') gives 
b'\\uabb3\\uab83\\uab79'

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue40845>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40845] idna encoding fails for Cherokee symbols

Reply via email to