Re: How to correctly deal with unicode strings?

monarch_dodra Wed, 27 Nov 2013 07:01:49 -0800


Beaophile also linked this article:
http://forum.dlang.org/thread/nieoqqmidngwoqwnk...@forum.dlang.org

On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughbywrote:

I've just been reading this article:http://mortoray.com/2013/11/27/the-string-type-is-broken/ andwanted to test if D performed in the same way as he describes,i.e. unicode strings being 'broken' because they are justarrays.

No. While unidcode string are "just" arrays, that's not why it's"broken". Unicode string are *stored* in arrays, as a sequence of"codeunits", but they are still decoded entire "codepoints" atonce, so that's not the issue.

The main issue is that in unicode, a "character" (if that meansanything), or a "grapheme", can be composed of two codepoints,that mesn't be separated. Currently, D does not know how to dealwith this.

Although i understand the difference between code units andcode points it's not entirely clear in D what i need to do toavoid the situations he describes. For example:
import std.algorithm;
import std.stdio;

void main(string[] args)
{
        char[] x = "noël".dup;

        assert(x.length == 6); // Actual
        // assert(x.length == 4); // Expected.

This is a source of confusion: a string is *not* a random accessrange. This means that "length" is not actually part of the"string interface": It is only an "underlying implementationdetail".


try this:
alias String = string;
static if (hasLength!String)
    assert(x.length == 4);
else
    assert(x.walkLength == 4);

This will work regardless of string's "width" (char/wchar/dchar).

        assert(x[0 .. 3] == "noe".dup); // Actual.
        // assert(x[0 .. 3] == "noë".dup); // Expected.

Again, don't slice your strings like that, a string isn't randomaccess nor sliceable. You have no guarantee your third characterwill start at index 3. You want:


assert(equal(x.take(3), "noe"));

Note that "x.take(3)" will not actually give you a slice, but alazy range. If you want a slice, you need to walk the string, andextract the index:


auto index = x.length - x.dropFront(3).length;
assert(x[0 .. index] == "noe");

Note that this is *only* "UTF-correct", but it is still wrongfrom a unicode point of view. Again, it's because ë is actually asingle grapheme composed of *two* codepoints.

        x.reverse;

        assert(x == "l̈eon".dup); // Actual
        // assert(x == "lëon".dup); // Expected.
}
Here i understand what is happening but how could i improvethis example to make the expected asserts true?


AFAIK, We don't have any way of dealing with this (yet).

Re: How to correctly deal with unicode strings?

Reply via email to