Another approach would be to create a database in either UTF-8 or UTF-16 character set. UTF-16 obviously provides a better storage utilization with some Asian locales.

Technically speaking UTF-8 and UTF-16 are different encodings of the same character set, so the internal impact of allowing both would be minimal (but still significant). And the conversion between the two is rather trivial.

An added advantage of UTF-16 is that all characters are fixed size, so it is easy to calculate space of character string given the number of characters.

Thanks,
Roy

Yoshinori Matsunobu wrote:
Hi Jay, Toru, all,

Personally, I like a concept of "Pluggable character set and collation" much better than simply rejecting all local encodings(such as EUC-JP, CP932).
I agree that UTF-8 is widely used in many applications (not limited to web
applications), but there are some cases that local encodings are better, especially text-oriented applications.
Example:
Couple of months ago I checked a data size of Wikipedia-Japan (UTF-8 based).

The size was 2700MB. When I converted to local encoding (EUC-JP), the size was 2013MB. In Wikipedia case, UTF-8 is 34% larger than local encoding. This is apparently very important for certain types of applications.
Not many people want to buy additional disks/servers to implement same
functionality.


Please also do not forget about collations, which sometimes need
considerations.

Character sets and collations are currently tightly coupled within MySQL. This is not good because: - Adding a character set or collation on MySQL currently requires MySQL source code modification, which is not acceptable in most cases. - Supporting a lot of character sets and collations is not easy. For example, non-Japanese database engineers have difficulties to support Japanese character set. So, I like "pluggable character set and collation" concept. For example: - UTF-8 as a default character set
- Exposing pluggable interface for additional server-side (and client-side
is possible) character set and collation
  - External developers can create character code conversion map (i.e EUC-JP
<-> Unicode)
  - External developers can write collation map (i.e utf8_jis_x_4061_1996)
- If client encoding and server(column) encoding are the same, character code conversion does not happen (same as current MySQL) - (Optional) If client encoding and server(column) encoding are different each other, character code conversion happens (same as current MySQL)

Regards,
----
Yoshinori Matsunobu
Senior MySQL Consultant
Sun Microsystems

MySQL Consulting Services:
http://www-jp.mysql.com/consulting/


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 30, 2008 1:45 AM
To: drizzle-discuss; Yoshinori Matsunobu
Subject: Toru's thoughts on UTF8 and CJK charsets

Hi Yoshi, all!

Toru has outlined some thoughts about UTF8 and CJK charsets and
standardizing drizzle on UTF8 here:

http://torum.net/2008/09/utf8-over-cjk-drizzle/

We'd very much like to get people's input and reactions to these ideas.

Cheers,

Jay


_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to