On 2018-05-29 20:28:54 +1000, Chris Angelico wrote:
> On Tue, May 29, 2018 at 8:09 PM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote:
> > On 2018-05-29 19:46:24 +1000, Chris Angelico wrote:
> >> That's basically what the chardet module does, and its error rate is
> >> far FAR higher than that. If you think it's easy to detect encodings,
> >> I'm sure the chardet maintainers will be happy to accept pull
> >> requests!
> >
> > We were talking about humans, not programs.
> >
> 
> Sure, but you're describing a set of rules. If you can define a set of
> rules that pin down the encoding, you could teach chardet to follow
> those rules. If you can't teach chardet to follow those rules, you
> can't teach a human to follow them either. What is the human going to
> do? Guess?

Xkcd to the rescue:

https://xkcd.com/1425/

There are a lot of things which are easy to do for a human (recognize a
bird, understand a sentence), but very hard to write a program for
(mostly because we don't understand how our brain works, I think).

I haven't looked in detail on how chardet works but it looks like has a
few simple statistical models for the probability of characters and
character sequences. This is very different from what a human does, who
a) recognises whole words, and b) knows what they mean and whether they
make sense in context.

For a sufficiently narrow range of texts, you can write a program which
is better at recognizing encoding or language than any human can (As an
obvious improvement to chardet, you could supply it with dictionaries of
all languages). However, in the general case that would need a strong
AI. And we aren't there yet, by far.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | h...@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment: signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to