On Saturday, 2 December 2017 at 22:16:09 UTC, Joakim wrote:
On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via Digitalmars-d wrote:
On 11/30/2017 9:23 AM, Kagamin wrote:
> On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki > cattermole wrote: > > Be aware Microsoft is alone in thinking that UTF-16 was > > awesome. Everybody else standardized on UTF-8 for Unicode. > > UCS2 was awesome. UTF-16 is used by Java, JavaScript, > Objective-C, Swift, Dart and ms tech, which is 28% of tiobe > index.

"was" :-) Those are pretty much pre-surrogate pair designs, or based
on them (Dart compiles to JavaScript, for example).

UCS2 has serious problems:

1. Most strings are in ascii, meaning UCS2 doubles memory consumption. Strings in the executable file are twice the size.

This is not true in Asia, esp. where the CJK block is extensively used. A CJK block character is 3 bytes in UTF-8, meaning that string sizes are 150% of the UCS2 encoding. If your code contains a lot of CJK text, that's a lot of bloat.

Yep, that's why five years back many of the major Chinese sites were still not using UTF-8:

http://xahlee.info/w/what_encoding_do_chinese_websites_use.html

Summary

Taiwan sites almost all use UTF-8. Very old ones still use BIG5.

Mainland China sites mostly still use GBK or GB2312, but a few newer ones use UTF-8.

Many top Japan, Korea, sites also use UTF-8, but some uses EUC (Extended Unix Code) variants.

This probably means that UTF-8 might dominate in the future.

mmmh

That led that Chinese guy to also rant against UTF-8 a couple years ago:

http://xahlee.info/comp/unicode_utf8_encoding_propaganda.html

A rant from someone reproaching a video it doesn't provide reasons why utf-8 is good by not providing any reasons why utf-8 is bad. I'm not denying the issues with utf-8, only that the ranter doesn't provide any useful info on what the issues the "Asian" encounter with it, besides legacy reasons (which are important but do not enter in judging the technical quality of an encoding). Add to that that he advocates for GB18030 which is quite inferior to utf-8 except in the legacy support area (here some of the advantages of utf-8 that GB-18030 does not possess: auto-synchronization, algorithmic mapping of codepoints, error detection). If his only beef with utf-8 is the size for CJK text then he shouldn't argue for UTF-32 as he seems to do at the end.

Reply via email to