On Thursday, November 30, 2017 18:32:46 A Guy With a Question via Digitalmars-d wrote: > On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis > > wrote: > > On Thursday, November 30, 2017 03:37:37 Walter Bright via > > Digitalmars-d wrote: > > Language-wise, I think that most of the UTF-16 is driven by the > > fact that Java went with UCS-2 / UTF-16, and C# followed them > > (both because they were copying Java and because the Win32 API > > had gone with UCS-2 / UTF-16). So, that's had a lot of > > influence on folks, though most others have gone with UTF-8 for > > backwards compatibility and because it typically takes up less > > space for non-Asian text. But the use of UTF-16 in Windows, > > Java, and C# does seem to have resulted in some folks thinking > > that wide characters means Unicode, and narrow characters > > meaning ASCII. > > > > - Jonathan M Davis > > I think it also simplifies the logic. You are not always looking > to represent the codepoints symbolically. You are just trying to > see what information is in it. Therefore, if you can practically > treat a codepoint as the unit of data behind the scenes, it > simplifies the logic.
Even if that were true, UTF-16 code units are not code points. If you want to operate on code points, you have to go to UTF-32. And even if you're at UTF-32, you have to worry about Unicode normalization, otherwise the same information can be represented differently even if all you care about is code points and not graphemes. And of course, some stuff really does care about graphemes, since those are the actual characters. Ultimately, you have to understand how code units, code points, and graphemes work and what you're doing with a particular algorithm so that you know at which level you should operate at and where the pitfalls are. Some code can operate on code units and be fine; some can operate on code points; and some can operate on graphemes. But there is no one-size-fits-all solution that makes it all magically easy and efficient to use. And UTF-16 does _nothing_ to improve any of this over UTF-8. It's just a different way to encode code points. And really, it makes things worse, because it usually takes up more space than UTF-8, and it makes it easier to miss when you screw up your Unicode handling, because more UTF-16 code units are valid code points than UTF-8 code units are, but they still aren't all valid code points. So, if you use UTF-8, you're more likely to catch your mistakes. Honestly, I think that the only good reason to use UTF-16 is if you're interacting with existing APIs that use UTF-16, and even then, I think that in most cases, you're better off using UTF-8 and converting to UTF-16 only when you have to. Strings eat less memory that way, and mistakes are more easily caught. And if you're writing cross-platform code in D, then Windows is really the only place that you're typically going to have to deal with UTF-16, so it definitely works better in general to favor UTF-8 in D programs. But regardless, at least D gives you the tools to deal with the different Unicode encodings relatively cleanly and easily, so you can use whichever Unicode encoding you need to. Most D code is going to use UTF-8 though. - Jonathan M Davis
