On 2018-05-23 07:38:27 +1000, Chris Angelico wrote:
> On Wed, May 23, 2018 at 7:23 AM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote:
> >> The best you can do is to go ask the canonical source of the
> >> file what encoding the file is _supposed_ to be in.
> >
> > I disagree on both counts.
> >
> > 1) For any given file it is almost always possible to find the correct
> >    encoding (or *a* correct encoding, as there may be more than one).
> 
> You can find an encoding which is capable of decoding a file. That's
> not the same thing.

If the result is correct, it is the same thing.

If I have an input file 

    4c 69 65 62 65 20 47 72 fc df 65 0a

and I decode it correctly to

    Liebe Grüße

it doesn't matter whether I used ISO-8859-1 or ISO-8859-2. The mapping
for all bytes in the input file is the same in both encodings.


> >    This may require domain-specific knowledge (e.g. it may be necessary
> >    to recognize the human language and know at least some distinctive
> >    words, or to know some special symbols likely to be used in a data
> >    file), and it almost always takes a bit of detective work and trial
> >    and error. But I don't think I ever encountered a file where I
> >    couldn't figure out the encoding.
> 
> Look up the old classic "bush hid the facts" hack with Windows
> Notepad. A pure ASCII file that got misdetected based on the byte
> patterns in it.

And would you have made the same mistake as notepad? Nope, I'm quite
sure that you are able to recognize an ASCII file with an English
sentence as ASCII. You wouldn't even consider that it could be UTF-16LE.


> If you restrict yourself to ASCII-compatible eight-bit encodings, you
> MAY be able to figure out what something is.
[...]
> But there are a number of annoyingly similar encodings around, where a
> large number of the mappings are the same, but you're left with just a
> few ambiguous bytes.

They are rarely ambiguous if you speak the language.

> And if you're considering non-ASCII-compatible encodings, things get a
> lot harder. UTF-16 can represent large slabs of Chinese text using the
> same bytes that would represent alphanumeric characters; so how can
> you distinguish it from base-64?

I'll ask my Chinese colleague to read it. If he can read it, it's almost
certainly Chinese and not base-64.

As I said, domain knowledge may be necessary. If you are decoding a file
which may contain a Chinese text, you may have to know Chinese to check
whether the decoded text makes sense.

If your job is to figure out the encoding of files which you don't
understand (and hence can't check whether your results are correct) I
will concede that this is impossible.

> I have encountered MANY files where I couldn't figure out the
> encoding. Some of them were quite possibly in ancient encodings (some
> had CR line endings), some were ambiguous, and on multiple occasions,
> I've had to deal with files that had more than one encoding in the
> same block of content.

Well, files with multiple encodings break the assumption that there is
*one* correct encoding. While I have encountered such files, too (as
well as multi-encodings and random errors), I don't think we were
talking about that.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | h...@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment: signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to