Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Roy Lyseng Mon, 29 Sep 2008 23:33:38 -0700

Another approach would be to create a database in either UTF-8 or UTF-16character set. UTF-16 obviously provides a better storage utilizationwith some Asian locales.

Technically speaking UTF-8 and UTF-16 are different encodings of thesame character set, so the internal impact of allowing both would beminimal (but still significant). And the conversion between the two israther trivial.

An added advantage of UTF-16 is that all characters are fixed size, soit is easy to calculate space of character string given the number ofcharacters.


Thanks,
Roy

Yoshinori Matsunobu wrote:

Hi Jay, Toru, all,
Personally, I like a concept of "Pluggable character set and collation"much better than simply rejecting all local encodings(such as EUC-JP,CP932).
I agree that UTF-8 is widely used in many applications (not limited to web
applications),but there are some cases that local encodings are better, especiallytext-oriented applications.
Example:
Couple of months ago I checked a data size of Wikipedia-Japan (UTF-8 based).
The size was 2700MB. When I converted to local encoding (EUC-JP),the size was 2013MB.In Wikipedia case, UTF-8 is 34% larger than local encoding.This is apparently very important for certain types of applications.
Not many people want to buy additional disks/servers to implement same
functionality.


Please also do not forget about collations, which sometimes need
considerations.
Character sets and collations are currently tightly coupled within MySQL.This is not good because:- Adding a character set or collation on MySQL currently requires MySQLsource code modification,which is not acceptable in most cases.- Supporting a lot of character sets and collations is not easy.For example, non-Japanese database engineers have difficulties to supportJapanese character set.So, I like "pluggable character set and collation" concept. For example:- UTF-8 as a default character set
- Exposing pluggable interface for additional server-side (and client-side
is possible) character set and collation
  - External developers can create character code conversion map (i.e EUC-JP
<-> Unicode)
  - External developers can write collation map (i.e utf8_jis_x_4061_1996)
- If client encoding and server(column) encoding are the same,character code conversion does not happen (same as current MySQL)- (Optional) If client encoding and server(column) encoding are differenteach other,character code conversion happens (same as current MySQL)
Regards,
----
Yoshinori Matsunobu
Senior MySQL Consultant
Sun Microsystems

MySQL Consulting Services:
http://www-jp.mysql.com/consulting/
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]Sent: Tuesday, September 30, 2008 1:45 AM
To: drizzle-discuss; Yoshinori Matsunobu
Subject: Toru's thoughts on UTF8 and CJK charsets

Hi Yoshi, all!

Toru has outlined some thoughts about UTF8 and CJK charsets and
standardizing drizzle on UTF8 here:

http://torum.net/2008/09/utf8-over-cjk-drizzle/
We'd very much like to get people's input and reactions tothese ideas.
Cheers,

Jay
_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp


_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to