On Saturday 15 January 2011 04:24:33 Michel Fortin wrote: > On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn > > <[email protected]> said: > > Nick Sabalausky wrote: > >> "Andrei Alexandrescu" <[email protected]> wrote in message > >> news:[email protected]... > >> > >>> This may sometimes not be what the user expected; most of the time > >>> they'd care about the code points. > >> > >> I dunno, spir has succesfuly convinced me that most of the time it's > >> graphemes the user cares about, not code points. Using code points is > >> just as misleading as using UTF-16 code units. > > > > I agree. This is a very informative thread, thanks spir and everybody > > else. > > > > Going back to the topic, it seems to me that a unicode string is a > > surprisingly complicated data structure that can be viewed from multiple > > types of ranges. In the light of this thread, a dchar doesn't seem like > > such a useful type anymore, it is still a low level abstraction for the > > purpose of correctly dealing with text. Perhaps even less useful, since > > it gives the illusion of correctness for those who are not in the know. > > > > The algorithms in std.string can be upgraded to work correctly with all > > the issues mentioned, but the generic ones in std.algorithm will just > > subtly do the wrong thing when presented with dchar ranges. And, as I > > understood it, the purpose of a VleRange was exactly to make generic > > algorithms just work (tm) for strings. > > > > Is it still possible to solve this problem or are we stuck with > > specialized string algorithms? Would it work if VleRange of string was a > > bidirectional range with string slices of graphemes as the ElementType > > and indexing with code units? Often used string algorithms could be > > specialized for performance, but if not, generic algorithms would still > > work. > > I have my idea. > > I think it'd be a good idea is to improve upon Andrei's first idea -- > which was to treat char[], wchar[], and dchar[] all as ranges of dchar > elements -- by changing the element type to be the same as the string. > For instance, iterating on a char[] would give you slices of char[], > each having one grapheme. > > The second component would be to make the string equality operator (==) > for strings compare them in their normalized form, so that ("e" with > combining acute accent) == (pre-combined "é"). I think this would make > D support for Unicode much more intuitive. > > This implies some semantic changes, mainly that everywhere you write a > "character" you must use double-quotes (string "a") instead of single > quote (code point 'a'), but from the user's point of view that's pretty > much all there is to change. > > There'll still be plenty of room for specialized algorithms, but their > purpose would be limited to optimization. Correctness would be taken > care of by the basic range interface, and foreach should follow suit > and iterate by grapheme by default. > > I wrote this example (or something similar) earlier in this thread: > > foreach (grapheme; "exposé") > if (grapheme == "é") > break; > > In this example, even if one of these two strings use the pre-combined > form of "é" and the other uses a combining acute accent, the equality > would still hold since foreach iterates on full graphemes and == > compares using normalization. > > The important thing to keep in mind here is that the grapheme-splitting > algorithm should be optimized for the case where there is no combining > character and the compare algorithm for the case where the string is > already normalized, since most strings will exhibit these > characteristics. > > As for ASCII, we could make it easier to use ubyte[] for it by making > string literals implicitly convert to ubyte[] if all their characters > are in ASCII range.
I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types. Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[] which holds a single grapheme. I think that it's a great idea that programmers try to use substrings and slices rather than dchar, but making the element type a slice the original type sounds like it's really asking for trouble. - Jonathan M Davis
