Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin Sat, 15 Jan 2011 21:25:19 -0800

On 2011-01-15 23:58:30 -0500, Jonathan M Davis <[email protected]> said:

On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:

On 2011-01-15 20:49:00 -0500, Jonathan M Davis <[email protected]> said:

On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:

I have my idea.

I think it'd be a good idea is to improve upon Andrei's first idea --
which was to treat char[], wchar[], and dchar[] all as ranges of dchar
elements -- by changing the element type to be the same as the string.
For instance, iterating on a char[] would give you slices of char[],
each having one grapheme.

The second component would be to make the string equality operator (

=)

for strings compare them in their normalized form, so that ("e" with
combining acute accent) == (pre-combined "é"). I think this woul

d m

ake

D support for Unicode much more intuitive.

This implies some semantic changes, mainly that everywhere you write a
"character" you must use double-quotes (string "a") instead of single
quote (code point 'a'), but from the user's point of view that's pretty
much all there is to change.

There'll still be plenty of room for specialized algorithms, but their
purpose would be limited to optimization. Correctness would be taken
care of by the basic range interface, and foreach should follow suit
and iterate by grapheme by default.

I wrote this example (or something similar) earlier in this thread:
        foreach (grapheme; "exposé")
        
                if (grapheme == "é")
                
                        break;

In this example, even if one of these two strings use the pre-combined
form of "é" and the other uses a combining acute accent, the equality
would still hold since foreach iterates on full graphemes and
compares using normalization.


I think that that would cause definite problems. Having the element
type of the range be the same type as the range seems like it could
cause a lot of problems in std.algorithm and the like, and it's
_definitely_ going to confuse programmers. I'd expect it to be highly
bug-prone. They _need_ to be separate types.


I remember that someone already complained about this issue because he
had a tree of ranges, and Andrei said he would take a look at this
problem eventually. Perhaps now would be a good time.

Now, given that dchar can't actually work completely as an element
type, you'd either need the string type to be a new type or the element
type to be a new type. So, either the string type has char[], wchar[],
or dchar[] for its element type, or char[], wchar[], and dchar[] have
something like uchar as their element type, where uchar is a struct
which contains a char[], wchar[], or dchar[]
which holds a single grapheme.


Having a new type for grapheme would work too. My preference still goes
to reusing the string type because it makes the semantic simpler to
understand, especially when comparing graphemes with literals.


If a character literal actually became a grapheme instead of a dchar, then

that would likely solve that issue. But I fear that the semantics ofhaving a range

be its own element type actually make understanding it _harder_, not simpler.

Being forced to compare a string literals against what should be acharacter would definitely confuse programmers.

Character literals are treated as simple numbers by the language. Bythat I mean that you can write 'b' - 'a' == 1 and it'll be true.Arithmetic makes absolutely no sense for graphemes. If you want aspecial literal for graphemes, I'm afraid you'll have to inventsomething new. And at this point, why not use a string?

Making a new character or grapheme type which represented a graphemewould be _far_ simpler to understand IMO. However, making it workreally well would likely require that the compiler know about thegrapheme type like it knows about dchar.

I'm looking for a simple solution. One that doesn't involve inventing anew grapheme literal syntax or adding new types the compiler most knowabout. I'm not really opposed to any of this, but the more complicatedis the solution, the less likely it is to be adopted.

All I'm asking is that Unicode strings behave as Unicode strings shouldbehave. Making iteration use graphemes by default and string comparisonuse the normalized form by default seems like a simple way to achievethat goal.

The most important is not the implementation, but that the defaultbehaviour be the right behaviour.



--
Michel Fortin
[email protected]
http://michelf.com/

Re: VLERange: a range in between BidirectionalRange and RandomAccessRange

Reply via email to