Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

Ali Çehreli via Digitalmars-d-learn Fri, 02 Dec 2022 14:56:35 -0800

On 12/2/22 13:18, thebluepandabear wrote:

> But I don't really understand this? What does it mean that it 'must be
> represented by at least 2 bytes'?


The integral value of Ğ in unicode is 286.

  https://unicodeplus.com/U+011E

Since 'char' is 8 bits, it cannot store 286.

At first, that sounds like a hopeless situation, making one think that Ğcannot be represented in a string. The concept of encoding to therescue: Ğ can be encoded by 2 chars:


import std.stdio;

void main() {
    foreach (c; "Ğ") {
        writefln!"%b"(c);
    }
}

That program prints

11000100
10011110

Articles like the following explain well how that second byte is acontinuation byte:


  https://en.wikipedia.org/wiki/UTF-8#Encoding

(It's a continuation byte because it starts with the bits 10).

> I don't think it was explained well in
> the book.

Coincidentally, according to another recent feedback I received, unicodeand UTF are introduced way too early for such a book. I agree. I hadn'tunderstood a single thing when the first time smart people were tryingto explain unicode and UTF encodings to the company where I worked at. Ihad years of programming experience back then. (Although, I now thinkthe instructors were not really good; and the company was pretty bad aswell. :) )


> Any help would be appreciated.

I recommend the Wikipedia page I linked above. It is enlightening tounderstand how about 150K unicode characters can be encoded with unitsof 8 bits.

You can safely ignore wchar, dchar, wstring, and dstring for dailycoding. Only special programs may need to deal with those types. 'char'and string are what we need and do use predominantly in D.

Ali

Re: Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

Reply via email to