On Sun, Jun 4, 2017 at 5:02 AM, Thomas Jollans <t...@tjol.eu> wrote: > On 03/06/17 20:41, Chris Angelico wrote: >> [snip] >> For reference, as well as the 948 Sm, there are 1690 Mn and 5777 So, >> but only these characters are valid from them: >> >> \u1885 Mn MONGOLIAN LETTER ALI GALI BALUDA >> \u1886 Mn MONGOLIAN LETTER ALI GALI THREE BALUDA >> ℘ Sm SCRIPT CAPITAL P >> ℮ So ESTIMATED SYMBOL >> >> 2118 SCRIPT CAPITAL P and 212E ESTIMATED SYMBOL are listed in >> PropList.txt as Other_ID_Start, so they make sense. But that doesn't >> explain the two characters from category Mn. It also doesn't explain >> why U+309B and U+309C are *not* valid, despite being declared >> Other_ID_Start. Maybe it's a bug? Maybe 309B and 309C somehow got >> switched into 1885 and 1886?? > > \u1885 and \u1886 are categorised as letters (category Lo) by my Python > 3.5. (Which makes sense, right?) If your system puts them in category > Mn, that's bound to be a bug somewhere.
rosuav@sikorsky:~$ python3.7 -c "import unicodedata; print(unicodedata.unidata_version, unicodedata.category('\u1885'))" 9.0.0 Mn rosuav@sikorsky:~$ python3.6 -c "import unicodedata; print(unicodedata.unidata_version, unicodedata.category('\u1885'))" 8.0.0 Lo rosuav@sikorsky:~$ python3.5 -c "import unicodedata; print(unicodedata.unidata_version, unicodedata.category('\u1885'))" 8.0.0 Lo rosuav@sikorsky:~$ python3.4 -c "import unicodedata; print(unicodedata.unidata_version, unicodedata.category('\u1885'))" 6.3.0 Lo Is it possible that there's a discrepancy between the Unicode version used by the unicodedata module and the one used by the parser? ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/