Nicolas Pontoizeau schrieb: > I am handling a mixed languages text file encoded in UTF-8. Theres is > mainly French, English and Asian languages. I need to detect every > asian characters in order to enclose it by a special tag for latex. > Does anybody know if there is a unicode "table of character" > implementation in python? I mean, I give a character and python replys > me with the language in which the character occurs.
This is a bit unspecific, so likely, nothing that already exists will be completely correct for your needs. If you need to escape characters for latex, I would expect that there is a more precise specification of what you need to escape - I doubt the fact that a character is used primarily in Asia matters much to latex. In any case, somebody pointed you to the Unicode code blocks. I think these are Asian scripts (I may have missed some): 0530..058F; Armenian 0590..05FF; Hebrew 0600..06FF; Arabic 0700..074F; Syriac 0750..077F; Arabic Supplement 0900..097F; Devanagari 0980..09FF; Bengali 0A00..0A7F; Gurmukhi 0A80..0AFF; Gujarati 0B00..0B7F; Oriya 0B80..0BFF; Tamil 0C00..0C7F; Telugu 0D00..0D7F; Malayalam 0D80..0DFF; Sinhala 0E00..0E7F; Thai 0E80..0EFF; Lao 0F00..0FFF; Tibetan 1000..109F; Myanmar 10A0..10FF; Georgian 1100..11FF; Hangul Jamo 1780..17FF; Khmer 1800..18AF; Mongolian 1900..194F; Limbu 1950..197F; Tai Le 1980..19DF; New Tai Lue 19E0..19FF; Khmer Symbols 2D00..2D2F; Georgian Supplement 2E80..2EFF; CJK Radicals Supplement 2F00..2FDF; Kangxi Radicals 2FF0..2FFF; Ideographic Description Characters 3000..303F; CJK Symbols and Punctuation 3040..309F; Hiragana 30A0..30FF; Katakana 3100..312F; Bopomofo 3130..318F; Hangul Compatibility Jamo 3190..319F; Kanbun 31A0..31BF; Bopomofo Extended 31C0..31EF; CJK Strokes 31F0..31FF; Katakana Phonetic Extensions 3200..32FF; Enclosed CJK Letters and Months 3300..33FF; CJK Compatibility 3400..4DBF; CJK Unified Ideographs Extension A 4DC0..4DFF; Yijing Hexagram Symbols 4E00..9FFF; CJK Unified Ideographs A000..A48F; Yi Syllables A490..A4CF; Yi Radicals AC00..D7AF; Hangul Syllables F900..FAFF; CJK Compatibility Ideographs FB50..FDFF; Arabic Presentation Forms-A FE30..FE4F; CJK Compatibility Forms FE70..FEFF; Arabic Presentation Forms-B 20000..2A6DF; CJK Unified Ideographs Extension B 2F800..2FA1F; CJK Compatibility Ideographs Supplement Notice that some scripts are used both in Asia and elsewhere, e.g. Latin and Cyrillic. Arabic probably doesn't belong in this list, either, being used both in Asia and elsewhere as the script of the official language. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
