Re: Inconsitency

Maxim Fomin Sun, 13 Oct 2013 10:06:16 -0700

On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:

This is simply wrong. All strings return number of codeunits.And it's only UTF-32 where codepoint (~ character) happens tofit into one codeunit.
I do not agree:
writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] +2 [D094], UTF-8)
   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.


This is not a single inconsistency here.

First of all, typeof("säд") yileds string type (immutable char)while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[],nor even dchar[] but int[]. In this case D is close to C whichalso treats character literals as integer type. Secondly,character arrays are only one who have two kinds of arrayliterals - usual [item. item, item] and special "blah", as yousee there is no correspondence between them.

If you try char[] x = cast(char[])['s', 'ä', 'д'] then lengthwould be indeed 3 (but don't use it - it is broken).

In D dynamic array is at binary level represented as struct {void *ptr; size_t length; }. When you perform some operations ondynamic arrays they are implemented by compiler as calls toruntime functions. However, during runtime it is impossible to dosomething useful on arrays for which there is only informationabout address of beginning and total elements (this is a sourceof other problems in D). To handle this, compiler generates andpasses as separate argument "TypeInfo" to runtime functions.TypeInfo contains some data, most relevant here is size of theelement.

What happens is follows. Compiler recognizes that "säд" should bestring literal and encoded as UTF-8(http://dlang.org/lex.html#DoubleQuotedString), so element typeshould be char, which requires to have 5 elements in array. So,at runtime an object "säд" is treated as array of 5 elements eachhaving 1 byte per element.

Basically string (and char[]) plays dual role in the language -on the one hand, it is array of elements having strictly 1 bytesize by definition, on the other hand D tries to use it as'generic' UTF type for which size is not fixed. So, there iscontradiction - in source code such strings are viewed byprogrammer as some abstract UTF string, but druntime views it as5 byte array. In my view, trouble begins when "säд" is internallycasted to char (which is no better than int[] x = [3.14, 5.6]).And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, sothere is great inconsistency here.

By the way, UTF definition is irrelevant here, this is pureimplementation issue (I think it is design fault).

Re: Inconsitency

Reply via email to