On 17.03.16 19:23, M.-A. Lemburg wrote:
On 17.03.2016 15:02, Serhiy Storchaka wrote:
On 17.03.16 15:14, M.-A. Lemburg wrote:
On 17.03.2016 01:29, Guido van Rossum wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since
tokenize.detect_encoding() is not available in Python 2.
I've attached an example implementation with tests, which works
in Python 2.7 and 3.
Sorry, but this code doesn't match the behaviour of Python interpreter,
nor other tools. I suggest to backport tokenize.detect_encoding() (but
be aware that the default encoding in Python 2 is ASCII, not UTF-8).
Yes, I got the default for Python 3 wrong. I'll fix that. Thanks
for the note.
What other aspects are different than what Python implements ?
1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".
2. If there is a BOM and coding cookie is not 'utf-8', this is an error.
3. If the first line is not blank or comment line, the coding cookie is
not searched in the second line.
4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and
"utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).
5. There isn't the limit of 400 bytes. Actually there is a bug with
handling long lines in current code, but even with this bug the limit is
6. I made a mistake in the regular expression, missed the underscore.
tokenize.detect_encoding() is the closest imitation of the behavior of
Python-Dev mailing list