Beaophile also linked this article:
http://forum.dlang.org/thread/nieoqqmidngwoqwnk...@forum.dlang.org

On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby wrote:
I've just been reading this article: http://mortoray.com/2013/11/27/the-string-type-is-broken/ and wanted to test if D performed in the same way as he describes, i.e. unicode strings being 'broken' because they are just arrays.

No. While unidcode string are "just" arrays, that's not why it's "broken". Unicode string are *stored* in arrays, as a sequence of "codeunits", but they are still decoded entire "codepoints" at once, so that's not the issue.

The main issue is that in unicode, a "character" (if that means anything), or a "grapheme", can be composed of two codepoints, that mesn't be separated. Currently, D does not know how to deal with this.

Although i understand the difference between code units and code points it's not entirely clear in D what i need to do to avoid the situations he describes. For example:

import std.algorithm;
import std.stdio;

void main(string[] args)
{
        char[] x = "noël".dup;

        assert(x.length == 6); // Actual
        // assert(x.length == 4); // Expected.

This is a source of confusion: a string is *not* a random access range. This means that "length" is not actually part of the "string interface": It is only an "underlying implementation detail".

try this:
alias String = string;
static if (hasLength!String)
    assert(x.length == 4);
else
    assert(x.walkLength == 4);

This will work regardless of string's "width" (char/wchar/dchar).

        assert(x[0 .. 3] == "noe".dup); // Actual.
        // assert(x[0 .. 3] == "noë".dup); // Expected.

Again, don't slice your strings like that, a string isn't random access nor sliceable. You have no guarantee your third character will start at index 3. You want:

assert(equal(x.take(3), "noe"));

Note that "x.take(3)" will not actually give you a slice, but a lazy range. If you want a slice, you need to walk the string, and extract the index:

auto index = x.length - x.dropFront(3).length;
assert(x[0 .. index] == "noe");

Note that this is *only* "UTF-correct", but it is still wrong from a unicode point of view. Again, it's because ë is actually a single grapheme composed of *two* codepoints.

        x.reverse;

        assert(x == "l̈eon".dup); // Actual
        // assert(x == "lëon".dup); // Expected.
}

Here i understand what is happening but how could i improve this example to make the expected asserts true?

AFAIK, We don't have any way of dealing with this (yet).

Reply via email to