[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Andrew Barnert via Python-ideas Sun, 27 Oct 2019 21:43:38 -0700

On Oct 27, 2019, at 18:00, Steven D'Aprano <st...@pearwood.info> wrote:
> 
> On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas 
> wrote:
> 
>>> File "/home/rosuav/tmp/demo.py", line 1
>>>   print("Hello, world!')
>>>                        ^
>>> SyntaxError: EOL while scanning string literal
>> 
>> So if those 12 glyphs take 14 code units 
> 
> I'm not really sure how glyphs (the graphical representation of a 
> character) comes into this discussion


Because, assuming your using a mono space font, the number of glyphs to the 
error is how many spaces you need to indent.

This example happens to be pure ASCII, so the count of glyphs, extended 
grapheme clusters, code units, and code points happens to be the name. But just 
change that e to an è made of two combining code units—like the ç in your 
previous example might have been—and now there are still the same number of 
glyphs and clusters; but one fewer code point and one fewer code unit.

Extended grapheme clusters are intended to be the best approximation of 
“characters” in the Unicode standard. Code units are not.

> but for what it's worth, I 
> count 22, not 12 (excluding the leading spaces).

Sorry; that was a typo. Plus, I miscounted on top of the typo; I meant to count 
the spaces.

>> because you’re using Stephen’s string and it’s in NFKD, getting 14 and 
>> then indenting two spaces too many (as Python does today)
> 
> You mean something like this?
> 
> 
>    py> value = äë +* 42
>      File "<stdin>", line 1
>        value = äë +* 42
>                      ^
>    SyntaxError: invalid syntax
> 
> (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}')
> 
> Yes, that looks like a bug to me, but a super low priority one to fix.

Yes.  It this is a general bug: everywhere that you count code units intending 
to use that as a count of glyphs or characters, both in Python itself and in 
third-party libraries and in applications. This is one of the most trivial 
examples, and you obviously wouldn’t break backward compatibility with 
everything solely to fix this example.

And I don’t know why I have to keep repeating this, but one more time: I’m not 
proposing to change Python, I’m arguing to _not_ change Python, because it’s 
already good enough, and the suggested improvement wouldn’t make it right 
despite breaking lots of code, and making it right is a big thing that would 
break even more code. If I were designing a new language, I would do it right 
from the start, and it would not have this bug, or any of the other 
manifestations of the same issue, but Python 4000 (or even 5000) is not an 
opportunity to design a new language.

(And to be clear: Python’s design made perfect sense when it was chosen; 
Unicode has just gotten more complicated since then. In fact, most other 
languages that adopted Unicode as early as Python got permanently stuck with 
the UCS-2 assumption, forcing all user code to deal with UTF-16 code units 
forever.)

> (This is assuming that the Python interpreter promises to line the caret 
> up with the offending symbol "always", rather than just making a best 
> effort to do so.)

Well, the reason I called it a good-enough best effort is because I assume that 
it’s only meant to be a best effort, and I think it’s good enough for that.

I’m not the one who said people would be up in arms if that were broken, I’m 
the one arguing that people are fine with it being broken as long as it’s 
usually good enough.

> And probably tough to fix too: I think you need to count in grapheme 
> clusters, not code points, 

Yes, that’s the whole point of the message you were responding to: extended 
grapheme clusters are the Unicode approximation of characters; code units are 
not. And a randomly-accessible sequence of grapheme clusters is impossible to 
do efficiently, but a sized iterable container, or a sequence-like thing that’s 
indexable by special indexes but not by integers, is. So tying the string type 
even more closely to code units would not fix it; changing the way it works as 
a Sequence would not fix it.

> but even that won't fix the issue since it 
> leaves you open to the *opposite* problem of undercounting if the 
> terminal or IDE fails to display combining characters properly:
> 
>        value = a¨e¨ +* 42
>                    ^
>    SyntaxError: invalid syntax
> 
> I had to fake the above, because I couldn't find a terminal on my system 
> which would misdisplay COMBINING DIAERESIS, but I've seen editors do it.

That’s a matter of working around broken editors, terminals; and IDEs—which do 
exist, but are uncommon, and getting less common. Not having a workaround for a 
broken editor that most people don’t use is not a bug in the same way as being 
broken in a properly-working environment is.

(Not having a workaround for something broken that half the users in the world 
have to deal with, like Windows cmd.exe, would be a different story, of course. 
You can claim that it’s Windows’ bug, not yours, but that won’t make users 
happy. But I’m pretty sure that’s not an issue here.)

> Handling text in its full generality, including combining characters, 
> emojis, flags, East Asian wide character, etc, is really tough to do 
> right. For the Python interpreter, it would require a huge amount of 
> extra work for barely any payoff since 99.9% of Python syntax errors are 
> not going to include any of the funny cases.

Obviously you wouldn’t redesign the whole text API just to make syntax error 
carets line up. You would do that to make thousands of different things easier 
to write correctly, and lining up those carets is just one of those things, and 
nowhere near the most important one.

>>> Well, either that, or we need to make it so that " "*<AbstractIndex
>>> object at 0xb7ce1bf0> results in the correct number of spaces to
>>> indent it to that position. That ought to bring in plenty of
>>> pitchforks...
>> 
>> Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, 
>> bytes=22)? 
> 
> Hell yes. If I need 12 spaces, why should I be forced to work out how 
> many bytes the interpreter uses for that?

If you know you need 12 spaces, you just multiply by 12; why do you think you 
need to work anything out? Adding str * StrIndex doesn’t require taking away 
str * int.

Your example implied that you would be working out that count in some way—say, 
by calling str.find—and that you and many others would be horrified if that 
return value were not an integer, but you could multiply it by a string anyway. 
I don’t know why you see anything wrong with that, but I guessed that maybe it 
was because you couldn’t see, at the REPL, how many spaces you were 
multiplying. Having the thing that’s returned by str.find have the repr 
CharIndex(chars=12, points=14, bytes=22) instead of the generic repr would 
solve that. If that isn’t your problem with being able to multiply a str by a 
StrIndex, then I have no other guesses for what you think people would be 
raising pitchforks over.

>> This is all simple stuff; I don’t get the incredulity 
>> that it could possibly be done. (Especially given that there are other 
>> languages that do exactly the same thing, like Swift, which ought to 
>> be proof that it’s not impossible.)
> 
> Can you link to an explanation of what Swift *actually* does, in detail?

The reference documentation for String starts at 
https://developer.apple.com/documentation/swift/string. (It should be the first 
thing that comes up in any search engine for Swift string.) You can follow the 
links from there to String.Index and String.Iterator, and from either of those 
to BidirectionalCollection, and from there to Collection, which explains how 
indexing works in general.

There’s probably an easier to understand description at 
https://docs.swift.org/swift-book but it may not explain *exactly* what it 
does, because it’s meant as a user guide.

Two things that may be confusing: Swift uses the exact same words as Python for 
its iteration/etc. protocols but all with different meanings (e.g., a Swift 
Sequence is a Python Iterable; a Python Sequence is a Swift 
IndexableCollection; etc.), and Swift makes heavy use of static typing (e.g., 
just as there are no separate display literals for Array, Set, etc., there are 
no separate display literals for Character and String; the literal "x" is a 
Character if you store it in a Character lvalue, a len-1 String if you store it 
in a String, and a TypeError if you sort it in a double).
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/HSEEJKV5XS5LGNS4JHD4GIPNXXMQYDVD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Reply via email to