On 2011-01-15 18:59:27 -0500, Andrei Alexandrescu
<[email protected]> said:
I'm unclear on where this is converging to. At this point the
commitment of the language and its standard library to (a) UTF aray
representation and (b) code points conceptualization is quite strong.
Changing that would be quite difficult and disruptive, and the benefits
are virtually nonexistent for most of D's user base.
There's still a disagreement about whether a string or a code unit
array should be the default string representation, and whether
iterating on a code unit array should give you code unit or grapheme
elements. Of those who who participated in the discussion, I don't
think anyone is disputing the idea that a grapheme element is better
than a dchar element for iterating over a string.
It may be more realistic to consider using what we have as back-end for
grapheme-oriented processing.
For example:
struct Grapheme(Char) if (isSomeChar!Char)
{
private const Char[] rep;
...
}
auto byGrapheme(S)(S s) if (isSomeString!S)
{
...
}
string s = "Hello";
foreach (g; byGrapheme(s)
{
...
}
No doubt it's easier to implement it that way. The problem is that in
most cases it won't be used. How many people really know what is a
grapheme? Of those, how many will forget to use byGrapheme at one time
or another? And so in most programs string manipulation will misbehave
in the presence of combining characters or unnormalized strings.
If you want to help D programmers write correct code when it comes to
Unicode manipulation, you need to help them iterate on real characters
(graphemes), and you need the algorithms to apply to real characters
(graphemes), not the approximation of a Unicode character that is a
code point.
--
Michel Fortin
[email protected]
http://michelf.com/