Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Sat, 15 Jan 2011 14:50:57 -0800

On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"<[email protected]> said:

On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin<[email protected]> wrote:
On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"<[email protected]> said:
I'm not suggesting we impose it, just that we make it the default. Ifyou want to iterate by dchar, wchar, or char, just write:
        foreach (dchar c; "exposé") {}
        foreach (wchar c; "exposé") {}
        foreach (char c; "exposé") {}
        // or
        foreach (dchar c; "exposé".by!dchar()) {}
        foreach (wchar c; "exposé".by!wchar()) {}
        foreach (char c; "exposé".by!char()) {}
and it'll work. But the default would be a slice containing thegrapheme, because this is the right way to represent a Unicodecharacter.
I think this is a good idea. I previously was nervous about it, butI'm not sure it makes a huge difference. Returning a char[] iscertainly less work than normalizing a grapheme into one or more codepoints, and then returning them. All that it takes is to detect allthe code points within the grapheme. Normalization can be done ifneeded, but would probably have to output another char[], since anormalized grapheme can occupy more than one dchar.
I'm glad we agree on that now.
It's a matter of me slowly wrapping my brain around unicode and howit's used. It seems like it's a typical committee defined standardwhere there are 10 ways to do everything, I was trying to weed out thelesser used (or so I perceived) pieces to allow a more implementablelibrary. It's doubly hard for me since I have limited experience withother languages, and I've never tried to write them with a computer(my language classes in high school were back in the days of actuallywriting stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just thatnobody hat an idea of the real scope of the problem they had in hand atfirst, and so they had to add a lot of things but wanted to keep thingsbackward-compatible. We're at Unicode 6.0 now, can you name one otherstandard that evolved enough to get 6 major versions? I'm surprisedit's not worse given all that it must support.

That said, I'm sure if someone could redesign Unicode by breakingbackward-compatibility we'd have something simpler. You could probablyget rid of pre-combined characters and reduce the number ofnormalization forms. But would you be able to get rid of normalizationentirely? I don't think so. Reinventing Unicode is probably not worthit.

I'm not opposed to that on principle. I'm a little uneasy about havingso many types representing a string however. Some other raw comments:
I agree that things would be more coherent if char[], wchar[], anddchar[] behaved like other arrays, but I can't really see ajustification for those types to be in the language if there's nothingspecial about them (why not a library type?).
I would not be opposed to getting rid of those types. But I am veryopposed to char[] not being an array. If you want a string to besomething other than an array, make it have a different syntax. Wealso have to consider C compatibility.
However, we are in radical-change mode then, and this is probablypushed to D3 ;) If we can find some way to fix the situation withoutinvalidating TDPL, we should strive for that first IMO.


Indeed, the change would probably be too radical for D2.

I think we agree that the default type should behave as a Unicodestring, not an array of characters. I understand your opposition toconflating arrays of char with strings, and I agree with you to acertain extent that it could have been done better. But we can't reallychange the type of string literals, can we. The only thing we canchange (I hope) at this point is how iterating on strings work.

Walter said earlier that he oppose changing foreach's default elementtype to dchar for char[] and wchar[] (as Andrei did for ranges) on theground that it would silently break D1 compatibility. This is a validpoint in my opinion.

I think you're right when you say that not treating char[] as an arrayof character breaks, to a certain extent, C compatibility. Anothervalid point.

That said, I want to emphasize that iterating by grapheme, contrary toiterating by dchar, does not break any code *silently*. The compilerwill complain loudly that you're comparing a string to a char, soyou'll have to change your code somewhere if you want things tocompile. You'll have to look at the code and decide what to do.


One more thing:

NSString in Cocoa is in essence the same thing as I'm proposing here:as array of UTF-16 code units, but with string behaviour. It supportsby-code-unit indexing, but appending, comparing, searching forsubstrings, etc. all behave correctly as a Unicode string. Again, Iagree that it's probably not the best design, but I can tell you itworks well in practice. In fact, NSString doesn't even expose theconcept of grapheme, it just uses them internally, and you're prettymuch limited to the built-in operation. I think what we have here inconcept is much better... even if it somewhat conflates code-unitarrays and strings.

Or you could make a grapheme a string_t. ;-)
I'm a little uneasy having a range return itself as its element type.For all intents and purposes, a grapheme is a string of one 'element',so it could potentially be a string_t.
It does seem daunting to have so many types, but at the same time,types convey relationships at compile time that can make codingimpossible to get wrong, or make things actually possible when havinga single type doesn't.
I'll give you an example from a previous life:

[...]
I feel that making extra types when the relationship between them isimportant is worth the possible repetition of functionality. Catchingbugs during compilation is soooo much better than experiencing themduring runtime.

I can understand the utility of a separate type in your DateTimeexample, but in this case I fail to see any advantage.

I mean, a grapheme is a slice of a string, can have multiple codepoints (like a string), can be appended the same way as a string, canbe composed or decomposed using canonical normalization orcompatibility normalization (like a string), and should be sorted,uppercased, and lowercased according to Unicode rules (like a string).Basically, a grapheme is just a string that happens to contain only onegrapheme. What would a custom type do differently than a string?

Also, grapheme == "a" is easy to understand because both are strings.But if a grapheme is a separate type, what would a grapheme literallook like?

So in the end I don't think a grapheme needs a specific type, at leastnot for general purpose text processing. If I split a string onwhitespace, do I get a range where elements are of type "word"? No,just sliced strings.

That said, I'm much less concerned by the type used to represent agrapheme than by the Unicode correctness. I'm not opposed to a separatetype, I just don't really see the point.


--
Michel Fortin
[email protected]
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to