On 11/21/10 7:11 PM, Michel Fortin wrote:
On 2010-11-20 18:58:33 -0500, Andrei Alexandrescu
<[email protected]> said:
D strings exhibit no such problems. They expose their implementation -
array of code units. Having that available is often handy. They also
obey a formal interface - bidirectional ranges.
It's convenient that char[] and wchar[] expose a dchar bidirectional
range interface... but only when a dchar bidirectional range is what you
want to use. If you want to iterate over code units (lower-level
representation), or graphemes (upper-level representation), then it gets
in your way.
I agree.
There is no easy notion of "character" in unicode. A code point is *not*
a character. One character can span multiple code points. I fear
treating dchars as "the default character unit" is repeating same kind
of mistake earlier frameworks made by adopting UCS-2 (now UTF-16) and
treating each 2-byte code unit as a character. I mean, what's the point
of working with the intermediary representation (code points) when it
doesn't represent a character?
I understand the concern, and that's why I strongly support formal
abstractions that are supported by, but largely independent from,
representations. If graphemes are to be modeled, D is in better shape
than other languages. What we need to do is define a range byGrapheme()
that accepts char[], wchar[], or dchar[].
Instead, I think it'd be better that the level one wants to work at be
made explicit. If one wants to work with code points, he just rolls a
code-point bidirectional range on top of the string. If one wants to
work with graphemes (user-perceived characters), he just rolls a
grapheme bidirectional range on top of the string. In other words:
string str = "hello";
foreach (cu; str) {} // code unit iteration
foreach (cp; str.codePoints) {} // code point iteration, bidirectional
range of dchar
foreach (gr; str.graphemes) {} // grapheme iteration, bidirectional
range of graphemes
That'd be much cleaner than having some sort of hybrid
code-point/code-unit array/range.
Here's a nice reference about unicode graphemes, word segmentation, and
related algorithms.
<http://unicode.org/reports/tr29/>
I agree except for the fact that in my experience you want to iterate
over code points much more often than over code units. Iterating by code
unit by default is almost always wrong. That's why D's strings offer the
bidirectional interface by default. I have reasons to believe it was a
good decision.
Andrei