Collin Winter writes: > Sincere question: if these characters aren't needed, why are they > provided? From what I can tell by googling, they're needed when, e.g., > Arabic is embedded in an otherwise left-to-right script. Do I have > that right? That sounds pretty close to what you'd get when using > Arabic identifiers with the English keywords/stdlib.
The problem is visual presentation to humans. It's very much like unmarshalling little-endian integers from a byte stream. The byte stream by definition is big-endian, so when you simply memcpy into the stream buffer, little-endian integers will come out in reverse byte order. Bidi works a little bit differently; in principle it works both ways (if you start LTR then the RTL is in reverse order in the stream, and vice versa) since both kinds of script are character streams. But in both cases, *inside* the computer, there is a natural "big-endian" order and the computer does not get confused. That is one sense in which format characters are YAGNIs. Now, identifiers are by definition character streams. If an English speaker would pronounce the spelling of an English word "A B C", and an Arabic speaker an Arabic word as "1 2 3", then *as an identifier* the combination English then Arabic is spelled "A B C _ 1 2 3". And that's all the Python compiler needs to know. In fact, on the editor display this would be presented "ABC_321". In data entry, you'd see something like this key display A A B AB C ABC _ ABC_ 1 ABC_1 2 ABC_21 3 ABC_321 This can be done algorithmically (this is the "Unicode Technical Annex #9", aka "UAX #9", you may have seen references to), to a very high degree approximation to what human typesetters do in bidi cultures. Now suppose you want to see on screen the contents of memory cells as characters. Then you would put into memory something like "A B C _ LRO 1 2 3" where LRO is a control character that says "no matter what directional property has normally, override that with left-to-right until I say otherwise." That logical sequence of characters is indeed displayed "ABC_123". But how about those as identifiers? Note that in memory the sequence of printing characters is "A B C _ 1 2 3" in each case. So it makes sense to think of that as the identifier, *ignoring* the presentation control characters. Suppose we prohibit the directional control characters. Then a Unicode conforming editor will put the characters in logical order "A B C _ 1 2 3" in the file, and display them naturally (to a speaker of Arabic) as "ABC_321". This is going to be by far the most common case, and the user knows that it works this way. I don't see a problem here. Do you? OK, now let's consider the cases of breakage. Consider a malicious author who uses LRO as "A B C _ 1 2 LRO 3" which displays as "ABC_213" (IIRC, I haven't actually tried to implement bidi in a very long time). Can you think of a genuine use for that? I can't; I think it's a bad idea to allow it. On the other hand, you could have a situation where the printed documentation uses the UAX #9 bidi algorithm, and discusses the meaning of the identifier "ABC_321", while the reviewing programmer is using a broken editor which implements overrides but not the algorithm, and sees "ABC_123". So in the case where LRO is permitted, the author can enforce the visual order that the reviewer will see in the documents on both the documents and the editor display. But since it's the unnatural (to an Arabic reader) "ABC_123", it will be confusing and hard to read. Is this a win? As somebody (I think Jim J) pointed out, bidi is a world of pain unless and until *all* editors and readers implement a common set of display conventions. Python can't do anything that will unambiguously reduce that pain. So IMHO it is best to conform to a standard that can be unambiguously implemented, and is likely to be available to the majority of programmers who need to work with bidi environments. That is UAX #31, which mandates ignoring these format characters (in the default profile), and strongly recommends prohibiting them in all profiles. _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com