On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.

I do not agree:

writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8)
   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.

This is not a single inconsistency here.

First of all, typeof("säд") yileds string type (immutable char) while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], nor even dchar[] but int[]. In this case D is close to C which also treats character literals as integer type. Secondly, character arrays are only one who have two kinds of array literals - usual [item. item, item] and special "blah", as you see there is no correspondence between them.

If you try char[] x = cast(char[])['s', 'ä', 'д'] then length would be indeed 3 (but don't use it - it is broken).

In D dynamic array is at binary level represented as struct { void *ptr; size_t length; }. When you perform some operations on dynamic arrays they are implemented by compiler as calls to runtime functions. However, during runtime it is impossible to do something useful on arrays for which there is only information about address of beginning and total elements (this is a source of other problems in D). To handle this, compiler generates and passes as separate argument "TypeInfo" to runtime functions. TypeInfo contains some data, most relevant here is size of the element.

What happens is follows. Compiler recognizes that "säд" should be string literal and encoded as UTF-8 (http://dlang.org/lex.html#DoubleQuotedString), so element type should be char, which requires to have 5 elements in array. So, at runtime an object "säд" is treated as array of 5 elements each having 1 byte per element.

Basically string (and char[]) plays dual role in the language - on the one hand, it is array of elements having strictly 1 byte size by definition, on the other hand D tries to use it as 'generic' UTF type for which size is not fixed. So, there is contradiction - in source code such strings are viewed by programmer as some abstract UTF string, but druntime views it as 5 byte array. In my view, trouble begins when "säд" is internally casted to char (which is no better than int[] x = [3.14, 5.6]). And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so there is great inconsistency here.

By the way, UTF definition is irrelevant here, this is pure implementation issue (I think it is design fault).

Reply via email to