Beaophile also linked this article:
http://forum.dlang.org/thread/nieoqqmidngwoqwnk...@forum.dlang.org
On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby
wrote:
I've just been reading this article:
http://mortoray.com/2013/11/27/the-string-type-is-broken/ and
wanted to test if D performed in the same way as he describes,
i.e. unicode strings being 'broken' because they are just
arrays.
No. While unidcode string are "just" arrays, that's not why it's
"broken". Unicode string are *stored* in arrays, as a sequence of
"codeunits", but they are still decoded entire "codepoints" at
once, so that's not the issue.
The main issue is that in unicode, a "character" (if that means
anything), or a "grapheme", can be composed of two codepoints,
that mesn't be separated. Currently, D does not know how to deal
with this.
Although i understand the difference between code units and
code points it's not entirely clear in D what i need to do to
avoid the situations he describes. For example:
import std.algorithm;
import std.stdio;
void main(string[] args)
{
char[] x = "noël".dup;
assert(x.length == 6); // Actual
// assert(x.length == 4); // Expected.
This is a source of confusion: a string is *not* a random access
range. This means that "length" is not actually part of the
"string interface": It is only an "underlying implementation
detail".
try this:
alias String = string;
static if (hasLength!String)
assert(x.length == 4);
else
assert(x.walkLength == 4);
This will work regardless of string's "width" (char/wchar/dchar).
assert(x[0 .. 3] == "noe".dup); // Actual.
// assert(x[0 .. 3] == "noë".dup); // Expected.
Again, don't slice your strings like that, a string isn't random
access nor sliceable. You have no guarantee your third character
will start at index 3. You want:
assert(equal(x.take(3), "noe"));
Note that "x.take(3)" will not actually give you a slice, but a
lazy range. If you want a slice, you need to walk the string, and
extract the index:
auto index = x.length - x.dropFront(3).length;
assert(x[0 .. index] == "noe");
Note that this is *only* "UTF-correct", but it is still wrong
from a unicode point of view. Again, it's because ë is actually a
single grapheme composed of *two* codepoints.
x.reverse;
assert(x == "l̈eon".dup); // Actual
// assert(x == "lëon".dup); // Expected.
}
Here i understand what is happening but how could i improve
this example to make the expected asserts true?
AFAIK, We don't have any way of dealing with this (yet).