On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote: > On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole > > wrote: > > Unicode has three main variants, UTF-8, UTF-16 and UTF-32. > > The size of a code point is 1, 2 or 4 bytes. > > I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 > (UTF-32) bytes are referred to as "code units" and the size of a > code point varies in UTF-8 and UTF-16.
Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them (IIRC) in a code point. For UTF-16, a code unit is 16 bits, and there are either 1 or 2 code units per code point. For UTF-32, a code unit is 32 bits, and there is always 1 code unit per code point. For better or worse (mostly worse), ranges then treat all strings as ranges of code points and decode them to code points such that get a range of dchar (which means fun things like isRandomAccessRange!string and hasLength!string are false). As I understand it, each code point is then something which can be physically printed, but either way, it's not necessarily a full character. Multiple code points can then be combined to make a grapheme cluster (which then corresponds to what we'd normally consider a full character - e.g. a letter and an accent can each be a code point which are then combined to create an accented character). std.uni provides the functionality for operating on graphemes. And std.utf.byCodeUnit can be used to treat strings as ranges of code units instead of code points (and a fair bit of Phobos takes the solution of specializing range-based code for strings to avoid the auto-decoding). All in all, the whole thing is annoyingly complicated, though at least D is much more explicit about it than most languages, and I suspect that your average D programmer is better educated about Unicode than your average programmer. And having to figure out why the heck strings and wstrings act so bizarrely as ranges does have the positive side effect of putting it even more in your face than it would be otherwise, making it that much more likely that folks are going to learn about Unicode - though I still think that we'd be better off if we could ever figure out how to treat all strings as ranges of code units without breaking everything in the process. :| - Jonathan M Davis