[issue44349] Edge case in pegen's error displaying with non-utf8 lines

Ammar Askar Tue, 08 Jun 2021 10:50:43 -0700

New submission from Ammar Askar <[email protected]>:

The AST currently stores column offsets for characters as byte-offsets. 
However, when displaying errors, these byte-offsets must be turned into 
character-offsets so that the characters line up properly with the characters 
on the line when printed. This is done with the function 
`byte_offset_to_character_offset` 
(https://github.com/python/cpython/blob/fdc7e52f5f1853e350407c472ae031339ac7f60c/Parser/pegen.c#L142-L161)
 which assumes that the line is UTF8 encoded.


However, consider a file like this:

  '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

This prints

  File "test-normal.py", line 1
    '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
                          ^^^^^^^^^^^^^^^^^^^^^^
  SyntaxError: Generator expression must be parenthesized

as expected.


However if we use a custom source encoding line:

  # -*- coding: cp437 -*-
  '┬ó┬ó┬ó┬ó┬ó┬ó' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError

it ends up printing out

  File "C:\Users\ammar\junk\test-utf16.py", line 2
    '¢¢¢¢¢¢' + f(4, 'Hi' for x in range(1)) # This line has a SyntaxError
                                      ^^^^^^^^^^^^^^^^^^^^^^
  SyntaxError: Generator expression must be parenthesized

where the carets/offsets are misaligned with the actual characters. This is 
because the string "┬ó" has the display width of 2 characters and encodes to 2 
bytes in cp437 but when interpreted as utf-8 is the single character "¢" with a 
display width of 1.

Note that this edge case is relatively hard to trigger because ordinarily what 
will happen here is that the call to PyErr_ProgramTextObject will fail because 
it tries to decode the line as utf-8: 
https://github.com/python/cpython/blob/ae3c66acb89a6104fcd0eea760f80a0287327cc4/Python/errors.c#L1693-L1696
 after which the error handling logic uses the tokenizer's internal buffer 
which has a proper utf-8 string.
So this bug requires the input to be valid as both utf-8 and the source 
encoding.

(Discovered while implementing PEP 657 
https://github.com/colnotab/cpython/issues/10)

----------
components: Parser
messages: 395347
nosy: ammar2, lys.nikolaou, pablogsal
priority: normal
severity: normal
status: open
title: Edge case in pegen's error displaying with non-utf8 lines
versions: Python 3.10, Python 3.11, Python 3.9

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue44349>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44349] Edge case in pegen's error displaying with non-utf8 lines

Reply via email to