[issue43950] Include column offsets for bytecode instructions

Ammar Askar Mon, 19 Jul 2021 11:24:25 -0700

Ammar Askar <[email protected]> added the comment:

Had some time to look into this. Just to summarize this problem, it deals with 
unicode points that are single characters but take up more than the width of a 
single character, even with a monospace font [1].


In the examples from above, the Chinese character itself counts as one 
character in a Python string. However, notice that it needs two carets:

>>> x = "该"
>>> print(x)
该
>>> len(x)
1
>>> print(x + '\n' + '^^')
该
^^

This issue is somewhat font dependent, in the case of the emoji I know that 
windows sometimes renders emojis as single-character wide black-and-white 
glyphs or colorful ones depending on the program.

As Pablo alluded to, unicodedata.east_asian_width is probably the best solution 
we can implement. For these wide characters it provides:

>>> unicodedata.east_asian_width('💩')
'W'
>>> unicodedata.east_asian_width('该')
'W'

W corresponding to Wide. Whereas for regular width characters:

>>> unicodedata.east_asian_width('b')
'Na'
>>> unicodedata.east_asian_width('=')
'Na'

we get Neutral (Not East Asian). This can be used to count the "displayed 
width" of the characters and hence the carets. However, organization is going 
to be a bit tricky since we're currently using 
_PyPegen_byte_offset_to_character_offset to get offsets to use for string 
slicing in the ast segment parsing code. We might have to make a separate 
function that gets the font display-width.

-------------

[1] Way more details on this issue here: 
https://denisbider.blogspot.com/2015/09/when-monospace-fonts-arent-unicode.html 
and an example of a Python library that tries to deal with this issue here: 
https://github.com/jquast/wcwidth

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43950>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43950] Include column offsets for bytecode instructions

Reply via email to