On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote: > Mikhail V <mikhail...@gmail.com>: > > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote: > > > > > > Random access to code points is as uninteresting as > > > random access to UTF-8 bytes. I might want random access > > > to the "Grapheme clusters, a.k.a.real characters". > > > > What _real_ characters are you referring to? If your data > > has "á" (U00E1), then it is one real character, if you > > have "a" (U0061) and "ˊ" (U02CA) then it is _two_ real > > characters. So in both cases you have access to code > > points = real characters. > > It's true that confusion is caused by the ambiguity of the > term "character." > > > For metaphysical discussion - in _my_ definition there is > > no such "real" character as "á", since it is the "a" glyph > > with some dirt, so according to my definition, it should > > be two separate characters, both semantically and > > technically seen. > > Here's the problem: when the human user types in "á" (with > one, two or three keyclicks), they don't know how the > computer represents it internally. The Unicode standard > allows for two *equivalent* code point sequences (<URL: > https://en.wikipedia.org/wiki/Unicode_equivalence>). When > the computer outputs the sequence, the visible result is > the single letter "á". The human user doesn't know—or > care—about the internal representation.
*EXACTLY*. But your statement is far too general. Not only need not the _human_user_ be concerned with these low level aspects of strings, but the _programmer_ need not be concerned either. The programmer should only see strings from a practical standpoint: "Can i index the chars within them?" "Can i determine the length of them?" "Can i slice and dice and combine them?" "Can i trust that the character positions will maintain order?" "Can i, and my target users, display them in a human readable form using various rendering specifications defined by graphic designers (aka: font-o-philes)?" If the answer to all of these questions is *YES*, then you know all you need to know about strings. Now get to work!!! > The user's expectation is that the visible letter "á" > should behave like any other single letter. For example, a > text editor should move the cursor past it with a single > click of a left or right arrow key. Also, if I perform a > regular-expression search in the editor and look for > > Alv[aá]rez > > I should get a match with either Alvarez or Alvárez. While what you say is relevant to _text_editors_ and sub string searching tools, you have wandered beyond the topic we are discussing here, which is practical interfacing between a programmer and his/her strings. How a text editor handles strings is irrelevant to a programmer. Unless of course we are writing a custome text editor software ourselves. In which case we can be the BDFL for a day, or two. *wink* > > And, in my definition, the whole Unicode is a huge > > junkyard, to start with. > > I don't think anybody denies that. However, it's the best > thing available and—more importantly—a universally accepted > standard. > > > But opinions may vary, and in case you prefer or forced to > > write "á", then it can be impractical to store it as two > > characters, regardless of encoding. > > Now I'm not following you. Mikhail is referring to the claims made earlier in this thread that accents are themselves distinct characters. Which i think is utter hooey. For instance, some folks here would wish for len("á") to return 2. Does that seem reasonable? -- https://mail.python.org/mailman/listinfo/python-list