On Fri, Jul 14, 2017 at 6:53 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: > Chris Angelico <ros...@gmail.com>: > >> On Fri, Jul 14, 2017 at 6:15 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: >>> Furthermore, you only dismissed my question about >>> >>> len(text) >>> >>> What about >>> >>> text[-1] >>> re.match("a.c", text) >> >> The considerations and concerns in the second half of my paragraph - >> the bit you didn't quote - directly address these two. > > I guess you refer to: > > These kinds of linguistic considerations shouldn't be codified into > the core of the language.
No, I don't. I refer to the second half of the paragraph you quoted the first half of. > Then, why bother with Unicode to begin with? Why not just use bytes? > After all, Python3's strings have the very same pitfalls: > > - you don't know the length of a text in characters > > - chr(n) doesn't return a character > > - you can't easily find the 7th character in a piece of text First you have to define "character". There are enough different definitions of "character" (for the purposes of counting/iteration/subscripting) that at least some of them have to be separate functions or methods. > - you can't compare the equality of two pieces of text > > - you can't use a piece of text as a reliable dict key (Dict key usage is defined in terms of equality, so these two are the same concern.) Yes, you can. For most purposes, textual equality should be defined in terms of NFC or NFD normalization. Python already gives you that. You could argue that a string should always be stored NFC (or NFD, take your pick), and then the equality operator would handle this; but I'm not sure the benefit is worth it. And you can't define equality by whether two strings would display identically, because then you lose semantic information (for instance, the difference between U+0020 and U+00A0, or between U+2004 and a pair of U+2006, or between U+004B and U+041A), not to mention the way that some fonts introduce confusing similarities that other fonts don't. If you're trying to use strings as identifiers in any way (say, file names, or document lookup references), using the NFC/NFD normalized form of the string should be sufficient. ChrisA -- https://mail.python.org/mailman/listinfo/python-list