On 2011-01-15 23:58:30 -0500, Jonathan M Davis <[email protected]> said:
On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
On 2011-01-15 20:49:00 -0500, Jonathan M Davis <[email protected]> said:
On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
I have my idea.
I think it'd be a good idea is to improve upon Andrei's first idea --
which was to treat char[], wchar[], and dchar[] all as ranges of dchar
elements -- by changing the element type to be the same as the string.
For instance, iterating on a char[] would give you slices of char[],
each having one grapheme.
The second component would be to make the string equality operator (
=)
for strings compare them in their normalized form, so that ("e" with
combining acute accent) == (pre-combined "é"). I think this woul
d m
ake
D support for Unicode much more intuitive.
This implies some semantic changes, mainly that everywhere you write a
"character" you must use double-quotes (string "a") instead of single
quote (code point 'a'), but from the user's point of view that's pretty
much all there is to change.
There'll still be plenty of room for specialized algorithms, but their
purpose would be limited to optimization. Correctness would be taken
care of by the basic range interface, and foreach should follow suit
and iterate by grapheme by default.
I wrote this example (or something similar) earlier in this thread:
foreach (grapheme; "exposé")
if (grapheme == "é")
break;
In this example, even if one of these two strings use the pre-combined
form of "é" and the other uses a combining acute accent, the equality
would still hold since foreach iterates on full graphemes and
compares using normalization.
I think that that would cause definite problems. Having the element
type of the range be the same type as the range seems like it could
cause a lot of problems in std.algorithm and the like, and it's
_definitely_ going to confuse programmers. I'd expect it to be highly
bug-prone. They _need_ to be separate types.
I remember that someone already complained about this issue because he
had a tree of ranges, and Andrei said he would take a look at this
problem eventually. Perhaps now would be a good time.
Now, given that dchar can't actually work completely as an element
type, you'd either need the string type to be a new type or the element
type to be a new type. So, either the string type has char[], wchar[],
or dchar[] for its element type, or char[], wchar[], and dchar[] have
something like uchar as their element type, where uchar is a struct
which contains a char[], wchar[], or dchar[]
which holds a single grapheme.
Having a new type for grapheme would work too. My preference still goes
to reusing the string type because it makes the semantic simpler to
understand, especially when comparing graphemes with literals.
If a character literal actually became a grapheme instead of a dchar, then
that would likely solve that issue. But I fear that the semantics of
having a range
be its own element type actually make understanding it _harder_, not simpler.
Being forced to compare a string literals against what should be a
character would definitely confuse programmers.
Character literals are treated as simple numbers by the language. By
that I mean that you can write 'b' - 'a' == 1 and it'll be true.
Arithmetic makes absolutely no sense for graphemes. If you want a
special literal for graphemes, I'm afraid you'll have to invent
something new. And at this point, why not use a string?
Making a new character or grapheme type which represented a grapheme
would be _far_ simpler to understand IMO. However, making it work
really well would likely require that the compiler know about the
grapheme type like it knows about dchar.
I'm looking for a simple solution. One that doesn't involve inventing a
new grapheme literal syntax or adding new types the compiler most know
about. I'm not really opposed to any of this, but the more complicated
is the solution, the less likely it is to be adopted.
All I'm asking is that Unicode strings behave as Unicode strings should
behave. Making iteration use graphemes by default and string comparison
use the normalized form by default seems like a simple way to achieve
that goal.
The most important is not the implementation, but that the default
behaviour be the right behaviour.
--
Michel Fortin
[email protected]
http://michelf.com/