On 2010-08-14, Martin v. Loewis <mar...@v.loewis.de> wrote: >> Is there a standard way to autodetect the encoding of a text file? > Use the chardet module: > http://chardet.feedparser.org/
Very timely: the python-chardet package just seems to have appeared on debian squeeze :-) After my latest "aptitude safe-upgrade": box8 (debian) ~> aptitude show python-chardet Package: python-chardet State: installed Automatically installed: yes Version: 2.0.1-1 Priority: optional Section: python Maintainer: Piotr Ożarowski <pi...@debian.org> Uncompressed Size: 721k Depends: python, python-support (>= 0.90.0) Description: universal character encoding detector Chardet takes a sequence of bytes in an unknown character encoding, and attempts to determine the encoding. Supported encodings: * ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants) * Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese) * EUC-JP, SHIFT_JIS, ISO-2022-JP (Japanese) * EUC-KR, ISO-2022-KR (Korean) * KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic) * ISO-8859-2, windows-1250 (Hungarian) * ISO-8859-5, windows-1251 (Bulgarian) * windows-1252 (English) * ISO-8859-7, windows-1253 (Greek) * ISO-8859-8, windows-1255 (Visual and Logical Hebrew) * TIS-620 (Thai) This library is a port of the auto-detection code in Mozilla. Homepage: http://chardet.feedparser.org/ Regards, Peter -- Peter Billam www.pjb.com.au www.pjb.com.au/comp/contact.html -- http://mail.python.org/mailman/listinfo/python-list