On Fri, Dec 02, 2022 at 09:18:44PM +0000, thebluepandabear via Digitalmars-d-learn wrote: > Hello (noob question), > > I am reading a book about D by Ali, and he talks about the different > char types: char, wchar, and dchar. He says that char stores a UTF-8 > code unit, wchar stores a UTF-16 code unit, and dchar stores a UTF-32 > code unit, this makes sense. > > He then goes on to say that: > > "Contrary to some other programming languages, characters in D may > consist of different numbers of bytes. For example, because 'Ğ' must > be represented by at least 2 bytes in Unicode, it doesn't fit in a > variable of type char. On the other hand, because dchar consists of 4 > bytes, it can hold any Unicode character." > > It's his explanation as to why this code doesn't compile even though Ğ > is a UTF-8 code unit: > > ```D > char utf8 = 'Ğ'; > ``` > > But I don't really understand this? What does it mean that it 'must be > represented by at least 2 bytes'? If I do `char.sizeof` it's 2 bytes > so I am confused why it doesn't fit, I don't think it was explained > well in the book.
That's wrong, char.sizeof should be exactly 1 byte, no more, no less. First, before we talk about Unicode, we need to get the terminology straight: Code unit = unit of storage in a particular representation (encoding) of Unicode. E.g., a UTF-8 string consists of a stream of 1-byte code units, a UTF-16 string consists of a stream of 2-byte code units, etc.. Do NOT confuse this with "code point", or worse, "character". Code point = the abstract Unicode entity that occupies a single slot in the Unicode tables. Usually written as U+xxx where xxx is some hexadecimal number. IMPORTANT NOTE: do NOT confuse a code point with what a normal human being thinks of as a "character". Even though in many cases a code point happens to represent a single "character", this isn't always true. It's safer to understand a code point as a single slot in one of the Unicode tables. NOTE: a code point may be represented by multiple code units, depending on the encoding. For example, in UTF-8, some code points require multiple code units (multiple bytes) to represent. This varies depending on the character; the code point `A` needs only a single code unit, but the code point `Ш` needs 3 bytes, and the code point `😀` requires 4 bytes. In UTF-16, `A` and `Ш` occupy only 1 code unit (2 bytes, because in UTF-16, one code unit == 2 bytes), but `😀` needs 2 code units (4 bytes). Note that neither code unit nor code point correspond directly with what we normally think of as a "character". The Unicode terminology for that is: Grapheme = one or more code points that combine together to produce a single visual representation. For example, the 2-code-point sequence U+006D U+030A produce the *single* grapheme `m̊`, and the 3-code-point sequence U+03C0 U+0306 U+032f produces the grapheme `π̯̆`. Note that each code point in these sequences may require multiple code units, depending on which encoding you're using. This email is encoded in UTF-8, so the first sequence occupies 3 bytes (1 byte for the 1st code point, 2 bytes for the second), and the second sequence occupies 6 bytes (2 bytes per code point). // OK, now let's talk about D. In D, we have 3 "character" types (I'm putting "character" in quotes because they are actually code units, do NOT confuse them with visual characters): char, wchar, dchar, which are 1, 2, and 4 bytes, respectively. To find out whether something fits into a char, first you have to find out how many code points it occupies, and second, how many code units are required to represent those code points. For example, the character `À` can be represented by the single code point U+00C0. However, it requires *two* UTF-8 code units to represent (this is a consequence of how UTF-8 represents code points), in spite of being a value that's less than 256. So U+00C0 would not fit into a single char; you need (at least) 2 chars to hold it. If we were to use UTF-16 instead, U+00C0 would easily fit into a single code unit. Each code unit in UTF-16, however, is 2 bytes, so for some code points (such as 'a', U+0061), the UTF-8 encoding would be smaller. A dchar always fits any Unicode code point, because code points can only go up to 0x10FFFF (max 3 bytes). HOWEVER, using dchar does NOT guarantee that it will hold a complete visual character, because Unicode graphemes can be arbitrarily long. For example, the `π̯̆` grapheme above requires at least 3 code points to represent, which means it requires at least 3 dchars (== 12 bytes) to represent. In UTF-8 encoding, however, it occupies only 6 bytes (still the same 3 code points, just encoded differently). // I hope this is clear (as mud :P -- Unicode is a complex beast). Or at least clear*er*, anyway. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG