On Tue, May 29, 2018 at 8:09 PM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > On 2018-05-29 19:46:24 +1000, Chris Angelico wrote: >> On Tue, May 29, 2018 at 6:15 PM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: >> > So if the text is German it will contain more words with >> > umlauts and each byte which is part of a correctly spelled German word >> > when interpreted according to ISO-8859-1 increases the probability that >> > decoding with ISO-8859-1 will produce the correct result. There remains >> > a tiny probability that all those matches are mere coincidence, but I >> > wrote "almost always", not "always", so I can live with an error rate of >> > 0.000001% (or something like that). >> >> That's basically what the chardet module does, and its error rate is >> far FAR higher than that. If you think it's easy to detect encodings, >> I'm sure the chardet maintainers will be happy to accept pull >> requests! > > We were talking about humans, not programs. >
Sure, but you're describing a set of rules. If you can define a set of rules that pin down the encoding, you could teach chardet to follow those rules. If you can't teach chardet to follow those rules, you can't teach a human to follow them either. What is the human going to do? Guess? ChrisA -- https://mail.python.org/mailman/listinfo/python-list