On Saturday 15 January 2011 19:25:47 Jonathan M Davis wrote: > On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote: > > On 1/15/11 4:45 PM, Michel Fortin wrote: > > > On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" > > > > > > <[email protected]> said: > > >> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin > > >> > > >> <[email protected]> wrote: > > >>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" > > >>> > > >>> <[email protected]> said: > > >>>>> I'm not suggesting we impose it, just that we make it the default. > > >>>>> If you want to iterate by dchar, wchar, or char, just write: > > >>>>> foreach (dchar c; "exposé") {} > > >>>>> foreach (wchar c; "exposé") {} > > >>>>> foreach (char c; "exposé") {} > > >>>>> // or > > >>>>> foreach (dchar c; "exposé".by!dchar()) {} > > >>>>> foreach (wchar c; "exposé".by!wchar()) {} > > >>>>> foreach (char c; "exposé".by!char()) {} > > >>>>> and it'll work. But the default would be a slice containing the > > >>>>> grapheme, because this is the right way to represent a Unicode > > >>>>> character. > > >>>> > > >>>> I think this is a good idea. I previously was nervous about it, but > > >>>> I'm not sure it makes a huge difference. Returning a char[] is > > >>>> certainly less work than normalizing a grapheme into one or more > > >>>> code points, and then returning them. All that it takes is to detect > > >>>> all the code points within the grapheme. Normalization can be done > > >>>> if needed, but would probably have to output another char[], since a > > >>>> normalized grapheme can occupy more than one dchar. > > >>> > > >>> I'm glad we agree on that now. > > >> > > >> It's a matter of me slowly wrapping my brain around unicode and how > > >> it's used. It seems like it's a typical committee defined standard > > >> where there are 10 ways to do everything, I was trying to weed out the > > >> lesser used (or so I perceived) pieces to allow a more implementable > > >> library. It's doubly hard for me since I have limited experience with > > >> other languages, and I've never tried to write them with a computer > > >> (my language classes in high school were back in the days of actually > > >> writing stuff down on paper). > > > > > > Actually, I don't think Unicode was so badly designed. It's just that > > > nobody hat an idea of the real scope of the problem they had in hand at > > > first, and so they had to add a lot of things but wanted to keep things > > > backward-compatible. We're at Unicode 6.0 now, can you name one other > > > standard that evolved enough to get 6 major versions? I'm surprised > > > it's not worse given all that it must support. > > > > > > That said, I'm sure if someone could redesign Unicode by breaking > > > backward-compatibility we'd have something simpler. You could probably > > > get rid of pre-combined characters and reduce the number of > > > normalization forms. But would you be able to get rid of normalization > > > entirely? I don't think so. Reinventing Unicode is probably not worth > > > it. > > > > > >>> I'm not opposed to that on principle. I'm a little uneasy about > > >>> having so many types representing a string however. Some other raw > > >>> comments: > > >>> > > >>> I agree that things would be more coherent if char[], wchar[], and > > >>> dchar[] behaved like other arrays, but I can't really see a > > >>> justification for those types to be in the language if there's > > >>> nothing special about them (why not a library type?). > > >> > > >> I would not be opposed to getting rid of those types. But I am very > > >> opposed to char[] not being an array. If you want a string to be > > >> something other than an array, make it have a different syntax. We > > >> also have to consider C compatibility. > > >> > > >> However, we are in radical-change mode then, and this is probably > > >> pushed to D3 ;) If we can find some way to fix the situation without > > >> invalidating TDPL, we should strive for that first IMO. > > > > > > Indeed, the change would probably be too radical for D2. > > > > > > I think we agree that the default type should behave as a Unicode > > > string, not an array of characters. I understand your opposition to > > > conflating arrays of char with strings, and I agree with you to a > > > certain extent that it could have been done better. But we can't really > > > change the type of string literals, can we. The only thing we can > > > change (I hope) at this point is how iterating on strings work. > > > > > > Walter said earlier that he oppose changing foreach's default element > > > type to dchar for char[] and wchar[] (as Andrei did for ranges) on the > > > ground that it would silently break D1 compatibility. This is a valid > > > point in my opinion. > > > > > > I think you're right when you say that not treating char[] as an array > > > of character breaks, to a certain extent, C compatibility. Another > > > valid point. > > > > > > That said, I want to emphasize that iterating by grapheme, contrary to > > > iterating by dchar, does not break any code *silently*. The compiler > > > will complain loudly that you're comparing a string to a char, so > > > you'll have to change your code somewhere if you want things to > > > compile. You'll have to look at the code and decide what to do. > > > > > > One more thing: > > > > > > NSString in Cocoa is in essence the same thing as I'm proposing here: > > > as array of UTF-16 code units, but with string behaviour. It supports > > > by-code-unit indexing, but appending, comparing, searching for > > > substrings, etc. all behave correctly as a Unicode string. Again, I > > > agree that it's probably not the best design, but I can tell you it > > > works well in practice. In fact, NSString doesn't even expose the > > > concept of grapheme, it just uses them internally, and you're pretty > > > much limited to the built-in operation. I think what we have here in > > > concept is much better... even if it somewhat conflates code-unit > > > arrays and strings. > > > > I'm unclear on where this is converging to. At this point the commitment > > of the language and its standard library to (a) UTF aray representation > > and (b) code points conceptualization is quite strong. Changing that > > would be quite difficult and disruptive, and the benefits are virtually > > nonexistent for most of D's user base. > > > > It may be more realistic to consider using what we have as back-end for > > grapheme-oriented processing. For example: > > > > struct Grapheme(Char) if (isSomeChar!Char) > > { > > > > private const Char[] rep; > > ... > > > > } > > > > auto byGrapheme(S)(S s) if (isSomeString!S) > > { > > > > ... > > > > } > > > > string s = "Hello"; > > foreach (g; byGrapheme(s) > > { > > > > ... > > > > } > > Considering that strings are already dealt with specially in order to have > an element of dchar, I wouldn't think that it would be all that > distruptive to make it so that they had an element type of Grapheme > instead. Wouldn't that then fix all of std.algorithm and the like without > really disrupting anything? > > The issue of foreach remains, but without being willing to change what > foreach defaults to, you can't really fix it - though I'd suggest that we > at least make it a warning to iterate over strings without specifying the > type. And if foreach were made to understand Grapheme like it understands > dchar, then you could do > > foreach(Grapheme g; str) { ... } > > and have the compiler warn about > > foreach(g; str) { ... } > > and tell you to use Grapheme if you want to be comparing actual characters. > Regardless, by making strings ranges of Grapheme rather than dchar, I would > think that we would solve most of the problem. At minimum, we'd have pretty > much the same problems that we have right now with char and wchar arrays, > but we'd get rid of a whole class of unicode problems. So, nothing would > be worse, but some of it would be better.
I suppose that the one major omission though is that string comparisons would be by code unit, not graphemes, which would be a problem. == could be made to use graphemes instead, but then you couldn't compare them by code units or code points unless you cast to ubyte[], ushort[], or uint[]... It would still probably be worth making == use graphemes though. - Jonathan M Davis
