iMath writes: > how to detect the encoding used for a specific text data ?
The practical thing to do is to try an encoding and see whether you find the expected frequent letters of the relevant languages in the decoded text, or the most frequent words. This is likely to help you decide between some of the most common encodings. Some decoding attempts may even raise an exception, which should be a clue. Strictly speaking, it cannot be done with complete certainty. There are lots of Finnish texts that are identical whether you think they are in Latin-1 or Latin-9. A further text from the same source might still reveal the difference, so the distinction matters. Short Finnish texts might also be identical whether you think they are in Latin-1 or UTF-8, but the situation is different: a couple of frequent letters turn into nonsense in the wrong encoding. It's easy to tell at a glance. Sometimes texts declare their encoding. That should be a clue, but in practice the declaration may be false. Sometimes there is a stray character that violates the declared or assumed encoding, or a part of the text is in one encoding and another part in another. Bad source. You decide how important it is to deal with the mess. (This only happens in the real world.) Good luck. -- http://mail.python.org/mailman/listinfo/python-list