On Saturday 15 January 2011 20:45:53 Michel Fortin wrote: > On 2011-01-15 20:49:00 -0500, Jonathan M Davis <[email protected]> said: > > On Saturday 15 January 2011 04:24:33 Michel Fortin wrote: > >> I have my idea. > >> > >> I think it'd be a good idea is to improve upon Andrei's first idea -- > >> which was to treat char[], wchar[], and dchar[] all as ranges of dchar > >> elements -- by changing the element type to be the same as the string. > >> For instance, iterating on a char[] would give you slices of char[], > >> each having one grapheme. > >> > >> The second component would be to make the string equality operator (= > > > > =) > > > >> for strings compare them in their normalized form, so that ("e" with > >> combining acute accent) == (pre-combined "é"). I think this would m > > > > ake > > > >> D support for Unicode much more intuitive. > >> > >> This implies some semantic changes, mainly that everywhere you write a > >> "character" you must use double-quotes (string "a") instead of single > >> quote (code point 'a'), but from the user's point of view that's pretty > >> much all there is to change. > >> > >> There'll still be plenty of room for specialized algorithms, but their > >> purpose would be limited to optimization. Correctness would be taken > >> care of by the basic range interface, and foreach should follow suit > >> and iterate by grapheme by default. > >> > >> I wrote this example (or something similar) earlier in this thread: > >> foreach (grapheme; "exposé") > >> > >> if (grapheme == "é") > >> > >> break; > >> > >> In this example, even if one of these two strings use the pre-combined > >> form of "é" and the other uses a combining acute accent, the equality > >> would still hold since foreach iterates on full graphemes and = > >> compares using normalization. > > > > I think that that would cause definite problems. Having the element > > type of the range be the same type as the range seems like it could > > cause a lot of problems in std.algorithm and the like, and it's > > _definitely_ going to confuse programmers. I'd expect it to be highly > > bug-prone. They _need_ to be separate types. > > I remember that someone already complained about this issue because he > had a tree of ranges, and Andrei said he would take a look at this > problem eventually. Perhaps now would be a good time. > > > Now, given that dchar can't actually work completely as an element > > type, you'd either need the string type to be a new type or the element > > type to be a new type. So, either the string type has char[], wchar[], > > or dchar[] for its element type, or char[], wchar[], and dchar[] have > > something like uchar as their element type, where uchar is a struct > > which contains a char[], wchar[], or dchar[] > > which holds a single grapheme. > > Having a new type for grapheme would work too. My preference still goes > to reusing the string type because it makes the semantic simpler to > understand, especially when comparing graphemes with literals.
If a character literal actually became a grapheme instead of a dchar, then that would likely solve that issue. But I fear that the semantics of having a range be its own element type actually make understanding it _harder_, not simpler. Being forced to compare a string literals against what should be a character would definitely confuse programmers. Making a new character or grapheme type which represented a grapheme would be _far_ simpler to understand IMO. However, making it work really well would likely require that the compiler know about the grapheme type like it knows about dchar. - Jonathan M Davis
