A proof of concept to replace unreliable python chardet

Ahmed TAHRI Fri, 13 Sep 2019 05:52:11 -0700

There is a very old *issue* regarding "encoding detection" in a text file
that has been partially resolved by a program like Chardet
<https://github.com/chardet/chardet>. I did not like the idea of single
prober per encoding table that could lead to hard coding specifications.


I wanted to challenge the existing methods of discovering originating
encoding.

You could consider this issue as obsolete because of current norms :

You should indicate used charset encoding as described in standards

But the reality is different, a huge part of the internet still have
content with an unknown encoding. (*One could point out subrip subtitle
(SRT) for instance*)

This is why a popular package like Requests
<https://github.com/psf/Requests> embed Chardet to guess apparent encoding
on remote resources.

https://github.com/Ousret/charset_normalizer

https://pypi.org/project/charset-normalizer

Charset Normalizer <https://github.com/Ousret/charset_normalizer>. *A first
PoC. Currently at version 0.3*

The Real First Universal Charset Detector. No Cpp Bindings. (13-Sept-19)

*LICENSE MIT*

ahmed.ta...@cloudnursery.dev
--
Python-announce-list mailing list -- python-announce-list@python.org
To unsubscribe send an email to python-announce-list-le...@python.org
https://mail.python.org/mailman3/lists/python-announce-list.python.org/

        Support the Python Software Foundation:
        http://www.python.org/psf/donations/

A proof of concept to replace unreliable python chardet

Reply via email to