Another approach would be to create a database in either UTF-8 or UTF-16
character set. UTF-16 obviously provides a better storage utilization
with some Asian locales.
Technically speaking UTF-8 and UTF-16 are different encodings of the
same character set, so the internal impact of allowing both would be
minimal (but still significant). And the conversion between the two is
rather trivial.
An added advantage of UTF-16 is that all characters are fixed size, so
it is easy to calculate space of character string given the number of
characters.
Thanks,
Roy
Yoshinori Matsunobu wrote:
Hi Jay, Toru, all,
Personally, I like a concept of "Pluggable character set and collation"
much better than simply rejecting all local encodings(such as EUC-JP,
CP932).
I agree that UTF-8 is widely used in many applications (not limited to web
applications),
but there are some cases that local encodings are better, especially
text-oriented applications.
Example:
Couple of months ago I checked a data size of Wikipedia-Japan (UTF-8 based).
The size was 2700MB. When I converted to local encoding (EUC-JP),
the size was 2013MB.
In Wikipedia case, UTF-8 is 34% larger than local encoding.
This is apparently very important for certain types of applications.
Not many people want to buy additional disks/servers to implement same
functionality.
Please also do not forget about collations, which sometimes need
considerations.
Character sets and collations are currently tightly coupled within MySQL.
This is not good because:
- Adding a character set or collation on MySQL currently requires MySQL
source code modification,
which is not acceptable in most cases.
- Supporting a lot of character sets and collations is not easy.
For example, non-Japanese database engineers have difficulties to support
Japanese character set.
So, I like "pluggable character set and collation" concept. For example:
- UTF-8 as a default character set
- Exposing pluggable interface for additional server-side (and client-side
is possible) character set and collation
- External developers can create character code conversion map (i.e EUC-JP
<-> Unicode)
- External developers can write collation map (i.e utf8_jis_x_4061_1996)
- If client encoding and server(column) encoding are the same,
character code conversion does not happen (same as current MySQL)
- (Optional) If client encoding and server(column) encoding are different
each other,
character code conversion happens (same as current MySQL)
Regards,
----
Yoshinori Matsunobu
Senior MySQL Consultant
Sun Microsystems
MySQL Consulting Services:
http://www-jp.mysql.com/consulting/
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 30, 2008 1:45 AM
To: drizzle-discuss; Yoshinori Matsunobu
Subject: Toru's thoughts on UTF8 and CJK charsets
Hi Yoshi, all!
Toru has outlined some thoughts about UTF8 and CJK charsets and
standardizing drizzle on UTF8 here:
http://torum.net/2008/09/utf8-over-cjk-drizzle/
We'd very much like to get people's input and reactions to
these ideas.
Cheers,
Jay
_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help : https://help.launchpad.net/ListHelp
_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help : https://help.launchpad.net/ListHelp