On Wed, May 23, 2018 at 7:23 AM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote:
>> The best you can do is to go ask the canonical source of the
>> file what encoding the file is _supposed_ to be in.
>
> I disagree on both counts.
>
> 1) For any given file it is almost always possible to find the correct
>    encoding (or *a* correct encoding, as there may be more than one).

You can find an encoding which is capable of decoding a file. That's
not the same thing.

>    This may require domain-specific knowledge (e.g. it may be necessary
>    to recognize the human language and know at least some distinctive
>    words, or to know some special symbols likely to be used in a data
>    file), and it almost always takes a bit of detective work and trial
>    and error. But I don't think I ever encountered a file where I
>    couldn't figure out the encoding.

Look up the old classic "bush hid the facts" hack with Windows
Notepad. A pure ASCII file that got misdetected based on the byte
patterns in it.

If you restrict yourself to ASCII-compatible eight-bit encodings, you
MAY be able to figure out what something is. (I have that exact
situation when parsing subtitles files.) Bizarre constructs like
"Tuuleen jδiseen mδ nostan pδδn" are a strong indication that the
encoding is wrong - if most of a word is ASCII, it's likely that the
non-ASCII bytes represent accented characters, not characters from a
completely different alphabet. But there are a number of annoyingly
similar encodings around, where a large number of the mappings are the
same, but you're left with just a few ambiguous bytes.

And if you're considering non-ASCII-compatible encodings, things get a
lot harder. UTF-16 can represent large slabs of Chinese text using the
same bytes that would represent alphanumeric characters; so how can
you distinguish it from base-64?

I have encountered MANY files where I couldn't figure out the
encoding. Some of them were quite possibly in ancient encodings (some
had CR line endings), some were ambiguous, and on multiple occasions,
I've had to deal with files that had more than one encoding in the
same block of content. (Or more frequently, not files but socket
connections. Same difference.) So no, you cannot always figure out a
file's encoding from its contents. Because that will, on some
occasions, violate the laws of physics - granted, that's merely a
misdemeanour in some states.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to