On Thursday, 30 November 2017 at 10:19:18 UTC, Walter Bright
wrote:
On 11/27/2017 7:01 PM, A Guy With an Opinion wrote:
+- Unicode support is good. Although I think D's string type
should have probably been utf16 by default. Especially
considering the utf module states:
"UTF character support is restricted to '\u0000' <= character
<= '\U0010FFFF'."
Seems like the natural fit for me. Plus for the vast majority
of use cases I am pretty guaranteed a char = codepoint. Not
the biggest issue in the world and maybe I'm just being overly
critical here.
Sooner or later your code will exhibit bugs if it assumes that
char==codepoint with UTF16, because of surrogate pairs.
https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
As far as I can tell, pretty much the only users of UTF16 are
Windows programs. Everyone else uses UTF8 or UCS32.
I recommend using UTF8.
As long as you understand it's limitations I think most bugs can
be avoided. Where UTF16 breaks down, is pretty well defined.
Also, super rare. I think UTF32 would be great to, but it seems
like just a waste of space 99% of the time. UTF8 isn't horrible,
I am not going to never use D because it uses UTF8 (that would be
silly). Especially when wstring also seems baked into the
language. However, it can complicate code because you pretty much
always have to assume character != codepoint outside of ASCII. I
can see a reasonable person arguing that it forcing you assume
character != code point is actually a good thing. And that is a
valid opinion.