for details

Eryk Sun Tue, 13 Apr 2021 02:38:33 -0700


Eryk Sun <eryk...@gmail.com> added the comment:


> P.S. No problems with Python 3.8.5 and Ubuntu 20.04.2 LTS.

The issue is that the line length is limited to BUFSIZ, which ends up splitting 
the UTF-8 sequence b'\xe2\x96\x91'. BUFSIZ is only 512 bytes in Windows. It's 
8192 bytes in Linux, in which case you need a line that's 16 times longer in 
order to reproduce the error. For example:

    $ stat -c "%s" test.py 
    8194
    $ python3.9 test.py
    SyntaxError: Non-UTF-8 code starting with '\xe2' in file 
    /home/someone/test.py on line 1, but no encoding declared; see 
    http://python.org/dev/peps/pep-0263/ for details

This has been fixed in a rewrite of the tokenizer (bpo-25643), for which the PR 
was recently merged into the main branch for 3.10a7+.

Maybe a minimal backport to keep reading up to "\n" can be applied to 3.8 and 
3.9.

----------
nosy: +eryksun

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38755>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38755] Long unicode string causes SyntaxError: Non-UTF-8 code starting with '\xe2' in file ..., but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Reply via email to