STINNER Victor <victor.stin...@gmail.com> added the comment:

> Function decoding_fgets (Parser/tokenizer.c) reads line in buffer
> of fixed size 8192 (line truncated to size 8191) and then fails
> because line is cut in the middle of a multibyte UTF-8 character.

It looks like BUFSIZ is much smaller than 8192 on Windows: it's maybe only 1024 
bytes.

Attached patch detects when a line is truncated (longer than the internal 
buffer).

A better solution is maybe to reallocate the buffer if the string is longer 
than the buffer (write a universal fgets which allocates the buffer while the 
line is read). Most functions parsing Python source code uses a dynamic buffer. 
For example "import module" now reads the whole file content before parsing it 
(see FileLoader.get_data() in Lib/importlib/_bootstrap.py).

At least, we should use a longer buffer on Windows (ex: use 8192 on all 
platforms?).

I only found two functions parsing the a Python file line by line: 
PyRun_InteractiveOneFlags() and PyRun_FileExFlags(). There are many variant of 
these functions (ex: PyRun_InteractiveOne and PyRun_File). These functions are 
part of the C Python API and used by programs to execute Python code when 
Python is embeded in a program.

PS: As noticed by Serhiy Storchaka, the bug is not specific to Windows. It's 
just that the internal buffer is much smaller on Windows.

----------
components: +Interpreter Core -Windows
keywords: +patch
nosy: +haypo
title: Syntax error on long UTF-8 lines -> decoding_fgets() truncates long 
lines and fails with a SyntaxError("Non-UTF-8 code starting with...")
Added file: http://bugs.python.org/file25605/detect_truncate.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue14811>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to