On 12/1/2017 3:16 PM, H. S. Teoh wrote:
This is not true in Asia, esp. where the CJK block is extensively used.
A CJK block character is 3 bytes in UTF-8, meaning that string sizes are
150% of the UCS2 encoding.  If your code contains a lot of CJK text,
that's a lot of bloat.

But then again, in non-Latin locales you'd generally store your strings
separately of the executable (usually in l10n files), so this may not be
that big an issue. But the blanket statement "Most strings are in ASCII"
is not correct.

Are you sure about that? I know that Asian languages will be longer in UTF-8. But how much data that programs handle is in those languages? The language of business, science, programming, aviation, and engineering is english.

Of course, D itself is agnostic about that. The compiler, for example, accepts strings, identifiers, and comments in Chinese in UTF-16 format.

Reply via email to