On Oct 27, 2019, at 18:00, Steven D'Aprano <st...@pearwood.info> wrote: > > On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas > wrote: > >>> File "/home/rosuav/tmp/demo.py", line 1 >>> print("Hello, world!') >>> ^ >>> SyntaxError: EOL while scanning string literal >> >> So if those 12 glyphs take 14 code units > > I'm not really sure how glyphs (the graphical representation of a > character) comes into this discussion
Because, assuming your using a mono space font, the number of glyphs to the error is how many spaces you need to indent. This example happens to be pure ASCII, so the count of glyphs, extended grapheme clusters, code units, and code points happens to be the name. But just change that e to an è made of two combining code units—like the ç in your previous example might have been—and now there are still the same number of glyphs and clusters; but one fewer code point and one fewer code unit. Extended grapheme clusters are intended to be the best approximation of “characters” in the Unicode standard. Code units are not. > but for what it's worth, I > count 22, not 12 (excluding the leading spaces). Sorry; that was a typo. Plus, I miscounted on top of the typo; I meant to count the spaces. >> because you’re using Stephen’s string and it’s in NFKD, getting 14 and >> then indenting two spaces too many (as Python does today) > > You mean something like this? > > > py> value = äë +* 42 > File "<stdin>", line 1 > value = äë +* 42 > ^ > SyntaxError: invalid syntax > > (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}') > > Yes, that looks like a bug to me, but a super low priority one to fix. Yes. It this is a general bug: everywhere that you count code units intending to use that as a count of glyphs or characters, both in Python itself and in third-party libraries and in applications. This is one of the most trivial examples, and you obviously wouldn’t break backward compatibility with everything solely to fix this example. And I don’t know why I have to keep repeating this, but one more time: I’m not proposing to change Python, I’m arguing to _not_ change Python, because it’s already good enough, and the suggested improvement wouldn’t make it right despite breaking lots of code, and making it right is a big thing that would break even more code. If I were designing a new language, I would do it right from the start, and it would not have this bug, or any of the other manifestations of the same issue, but Python 4000 (or even 5000) is not an opportunity to design a new language. (And to be clear: Python’s design made perfect sense when it was chosen; Unicode has just gotten more complicated since then. In fact, most other languages that adopted Unicode as early as Python got permanently stuck with the UCS-2 assumption, forcing all user code to deal with UTF-16 code units forever.) > (This is assuming that the Python interpreter promises to line the caret > up with the offending symbol "always", rather than just making a best > effort to do so.) Well, the reason I called it a good-enough best effort is because I assume that it’s only meant to be a best effort, and I think it’s good enough for that. I’m not the one who said people would be up in arms if that were broken, I’m the one arguing that people are fine with it being broken as long as it’s usually good enough. > And probably tough to fix too: I think you need to count in grapheme > clusters, not code points, Yes, that’s the whole point of the message you were responding to: extended grapheme clusters are the Unicode approximation of characters; code units are not. And a randomly-accessible sequence of grapheme clusters is impossible to do efficiently, but a sized iterable container, or a sequence-like thing that’s indexable by special indexes but not by integers, is. So tying the string type even more closely to code units would not fix it; changing the way it works as a Sequence would not fix it. > but even that won't fix the issue since it > leaves you open to the *opposite* problem of undercounting if the > terminal or IDE fails to display combining characters properly: > > value = a¨e¨ +* 42 > ^ > SyntaxError: invalid syntax > > I had to fake the above, because I couldn't find a terminal on my system > which would misdisplay COMBINING DIAERESIS, but I've seen editors do it. That’s a matter of working around broken editors, terminals; and IDEs—which do exist, but are uncommon, and getting less common. Not having a workaround for a broken editor that most people don’t use is not a bug in the same way as being broken in a properly-working environment is. (Not having a workaround for something broken that half the users in the world have to deal with, like Windows cmd.exe, would be a different story, of course. You can claim that it’s Windows’ bug, not yours, but that won’t make users happy. But I’m pretty sure that’s not an issue here.) > Handling text in its full generality, including combining characters, > emojis, flags, East Asian wide character, etc, is really tough to do > right. For the Python interpreter, it would require a huge amount of > extra work for barely any payoff since 99.9% of Python syntax errors are > not going to include any of the funny cases. Obviously you wouldn’t redesign the whole text API just to make syntax error carets line up. You would do that to make thousands of different things easier to write correctly, and lining up those carets is just one of those things, and nowhere near the most important one. >>> Well, either that, or we need to make it so that " "*<AbstractIndex >>> object at 0xb7ce1bf0> results in the correct number of spaces to >>> indent it to that position. That ought to bring in plenty of >>> pitchforks... >> >> Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, >> bytes=22)? > > Hell yes. If I need 12 spaces, why should I be forced to work out how > many bytes the interpreter uses for that? If you know you need 12 spaces, you just multiply by 12; why do you think you need to work anything out? Adding str * StrIndex doesn’t require taking away str * int. Your example implied that you would be working out that count in some way—say, by calling str.find—and that you and many others would be horrified if that return value were not an integer, but you could multiply it by a string anyway. I don’t know why you see anything wrong with that, but I guessed that maybe it was because you couldn’t see, at the REPL, how many spaces you were multiplying. Having the thing that’s returned by str.find have the repr CharIndex(chars=12, points=14, bytes=22) instead of the generic repr would solve that. If that isn’t your problem with being able to multiply a str by a StrIndex, then I have no other guesses for what you think people would be raising pitchforks over. >> This is all simple stuff; I don’t get the incredulity >> that it could possibly be done. (Especially given that there are other >> languages that do exactly the same thing, like Swift, which ought to >> be proof that it’s not impossible.) > > Can you link to an explanation of what Swift *actually* does, in detail? The reference documentation for String starts at https://developer.apple.com/documentation/swift/string. (It should be the first thing that comes up in any search engine for Swift string.) You can follow the links from there to String.Index and String.Iterator, and from either of those to BidirectionalCollection, and from there to Collection, which explains how indexing works in general. There’s probably an easier to understand description at https://docs.swift.org/swift-book but it may not explain *exactly* what it does, because it’s meant as a user guide. Two things that may be confusing: Swift uses the exact same words as Python for its iteration/etc. protocols but all with different meanings (e.g., a Swift Sequence is a Python Iterable; a Python Sequence is a Swift IndexableCollection; etc.), and Swift makes heavy use of static typing (e.g., just as there are no separate display literals for Array, Set, etc., there are no separate display literals for Character and String; the literal "x" is a Character if you store it in a Character lvalue, a len-1 String if you store it in a String, and a TypeError if you sort it in a double). _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/HSEEJKV5XS5LGNS4JHD4GIPNXXMQYDVD/ Code of Conduct: http://python.org/psf/codeofconduct/