Fredrik Lundh wrote: > John Machin wrote: > > > 3. ... and to check for missing maps. The OP may be working only with > > French text, and may not care about Icelandic and German letters, but > > other readers who stumble on this (and miss past thread(s) on this > > topic) may like something done with \xde (capital thorn), \xfe (small > > thorn) and \xdf (sharp s aka Eszett). > > I did post links to code that does this to this thread, several days ago... >
Ah yes, I missed that -- and your posting doesn't advertise that the code fixed the "one character should be mapped to two" cases :-) This code (http://effbot.python-hosting.com/file/stuff/sandbox/text/unaccent.py) looks generally very good, but I'm left wondering why "AE" and "OE" in the table, not "Ae and "Oe": [snip] 0xc6: u"AE", # LATIN CAPITAL LETTER AE <<<=== ?? 0xd0: u"D", # LATIN CAPITAL LETTER ETH 0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE <<<=== ?? 0xde: u"Th", # LATIN CAPITAL LETTER THORN [snip] Another point: there are many non-latin1 characters that could be mapped to ASCII. For example: u"\u0141ukasziewicz".translate(unaccented_map()) doesn't work unless an entry is added to the no-decomposition table: 0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE It looks like generating extra entries like that could be done, with the aid of unicodedata.name(): LATIN CAPITAL LETTER X WITH blahblah -> "X" LATIN SMALL LETTER X WITH blahblah -> "X".lower() This would require a fair bit of care -- obviously there are special cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional experts is probably required. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list