On Sunday, July 16, 2017 at 8:10:41 PM UTC+5:30, Rick Johnson wrote: > On Sunday, July 16, 2017 at 2:55:57 AM UTC-5, Marko Rauhamaa wrote: > > Mikhail V : > > > On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote: > > > > > > > > Random access to code points is as uninteresting as > > > > random access to UTF-8 bytes. I might want random access > > > > to the "Grapheme clusters, a.k.a.real characters". > > > > > > What _real_ characters are you referring to? If your data > > > has "á" (U00E1), then it is one real character, if you > > > have "a" (U0061) and "ˊ" (U02CA) then it is _two_ real > > > characters. So in both cases you have access to code > > > points = real characters. > > > > It's true that confusion is caused by the ambiguity of the > > term "character." > > > > > For metaphysical discussion - in _my_ definition there is > > > no such "real" character as "á", since it is the "a" glyph > > > with some dirt, so according to my definition, it should > > > be two separate characters, both semantically and > > > technically seen. > > > > Here's the problem: when the human user types in "á" (with > > one, two or three keyclicks), they don't know how the > > computer represents it internally. The Unicode standard > > allows for two *equivalent* code point sequences (<URL: > > https://en.wikipedia.org/wiki/Unicode_equivalence>). When > > the computer outputs the sequence, the visible result is > > the single letter "á". The human user doesn't know—or > > care—about the internal representation. > > *EXACTLY*. But your statement is far too general. Not only > need not the _human_user_ be concerned with these low level > aspects of strings, but the _programmer_ need not be concerned > either. The programmer should only see strings from a > practical standpoint: > > "Can i index the chars within them?" > > "Can i determine the length of them?" > > "Can i slice and dice and combine them?" > > "Can i trust that the character positions will maintain > order?" > > "Can i, and my target users, display them in a human > readable form using various rendering specifications defined > by graphic designers (aka: font-o-philes)?" > > If the answer to all of these questions is *YES*, then you > know all you need to know about strings. Now get to work!!! > > > The user's expectation is that the visible letter "á" > > should behave like any other single letter. For example, a > > text editor should move the cursor past it with a single > > click of a left or right arrow key. Also, if I perform a > > regular-expression search in the editor and look for > > > > Alv[aá]rez > > > > I should get a match with either Alvarez or Alvárez. > > While what you say is relevant to _text_editors_ and sub > string searching tools, you have wandered beyond the topic > we are discussing here, which is practical interfacing > between a programmer and his/her strings. How a text editor > handles strings is irrelevant to a programmer. Unless of > course we are writing a custome text editor software > ourselves. In which case we can be the BDFL for a day, or > two. *wink* > > > > And, in my definition, the whole Unicode is a huge > > > junkyard, to start with. > > > > I don't think anybody denies that. However, it's the best > > thing available and—more importantly—a universally accepted > > standard. > > > > > But opinions may vary, and in case you prefer or forced to > > > write "á", then it can be impractical to store it as two > > > characters, regardless of encoding. > > > > Now I'm not following you. > > Mikhail is referring to the claims made earlier in this > thread that accents are themselves distinct characters. > Which i think is utter hooey. For instance, some folks here > would wish for len("á") to return 2. Does that seem > reasonable?
$ python Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 12:22:00) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> len("á") 1 >>> len("á") 2 Shall we stipulate it to be 1.5? [¿ Maybe 1½ ?] -- https://mail.python.org/mailman/listinfo/python-list