Actually UTF-16 is not so space efficient even for East Asia users because ASCII values are 2 bytes. A-Z, 0-9, white space etc are frequently used.
In Wikipedia-Japan case, the size was 2674MB in UTF-16. (2700MB in UTF-8, 2013MB in local encoding) Having two columns (UTF8_text, UTF_16_text), inserting into UTF8_text for ascii-mostly-values, inserting into UTF16_text for multibyte-mostly-values would alleviate size penalty, but this seems too tricky from application perspective. Regards, ---- Yoshinori Matsunobu Senior MySQL Consultant Sun Microsystems MySQL Consulting Services: http://www-jp.mysql.com/consulting/ > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 01, 2008 2:12 AM > To: Jay Pipes > Cc: Yoshinori Matsunobu; 'drizzle-discuss'; Bernt M. Johnsen > Subject: Re: [Drizzle-discuss] Toru's thoughts on UTF8 and > CJK charsets > > For a japanese application, UTF-8 demands 34% increased > storage capacity > compared to CJK, but UTF-16 should have the same requirement... > > Cheers, > Roy > > Jay Pipes wrote: > > Sure, understood, but it doesn't assuage Yoshi's concern about 34% > > increase in storage requirements versus native CJK charsets... > > > > -j > > > > Roy Lyseng wrote: > >> I think that I would rather have a single character set inside the > >> server, but possibly allow client character set to be a > different one. > >> Thus, an application that must be compliant with CJK can > still interface > >> with the database. > >> > >> Does that make sense? > >> > >> Roy > >> > >> Jay Pipes wrote: > >>> Roy Lyseng wrote: > >>>> Jay Pipes wrote: > >>>>> Yoshi, I fully agree with you on decoupling the > collation and the > >>>>> charset. That work will be done at some point. > >>>>> > >>>>> Regarding pluggable character sets, the idea is > certainly in-line with > >>>>> the idea of Drizzle being pluggable, modular and > extensible, so I don't > >>>>> really see any conflict from a "vision" perspective. > That said, I > >>>>> think > >>>>> at this point the benefits we see in simplification of > the code base > >>>>> through limiting to UTF8 charset is demonstrable. I > think it makes > >>>>> sense to proceed with our current direction (of having > only UTF8 and > >>>>> multiple collations) and then add pluggable charsets > back into server > >>>>> core at a later point when the plugin API is refactored. > >>>>> > >>>>> To do that: > >>>>> > >>>>> a) The CHARSET_INFO struct must be refactored to remove the > >>>>> MY_COLLATION_HANDLER pointer. > >>>>> > >>>>> b) The MY_CHARSET_HANDLER struct should be refactored > into either a > >>>>> class which inherits from a base Plugin class or should > be turned > >>>>> into a > >>>>> type of plugin handler under the existing st_plugin > with a load of > >>>>> function pointer members stuff > >>>>> > >>>>> Right now, we can do a) fairly easily (maybe 1 week of > work for a > >>>>> developer), but b) is not so easy until we make a > concerted effort to > >>>>> make the plugin API easier to extend and to work with, IMHO. > >>>>> > >>>>> Regardless, your idea is a good one. > >>>>> > >>>>> Bernt and Roy, > >>>>> > >>>>> I assume if we did the above, that would satisfy your > points about > >>>>> UTF16 > >>>>> and 32? > >>>> Slight difference: Because UTF-8/16/32 are equivalent and > >>>> interchangable, you could reconfigure (probably before > creating the > >>>> initial database) and still have the same internal > functionality. If you > >>>> allow pluggable character sets, you must address > multiple simultaneous > >>>> character sets, character set conversions, introducers, > you name it... > >>> Hmm, good points...perhaps the best way to approach this > initially is to > >>> make the collations pluggable and then, if the desire is > there, add in > >>> pluggable charsets at a later point. Either that, or > limit the multiple > >>> charset operations. For instance, don't allow > introducers but do allow > >>> the client to do SET NAMES. Don't allow CONVERT(charset1 > TO charset2) > >>> but do allow indexes to be stored in a specific charset, etc. > >>> > >>> The simplicity we've reached from narrowing to only > support UTF8 is > >>> mainly maninfested in reduction of the parser and if > adding pluggable > >>> charsets back into the server increases the complexity of > the parser > >>> again, it's going to be a tough sell, particularly to > Brian (and me and > >>> others..) > >>> > >>> Cheers, and thanks for the input! > >>> > >>> Jay > >>> > >>>>> Cheers, > >>>>> > >>>>> Jay > >>>>> > >>>>> Bernt M. Johnsen wrote: > >>>>>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16): > >>>>>>> Another approach would be to create a database in > either UTF-8 or > >>>>>>> UTF-16 character set. UTF-16 obviously provides a > better storage > >>>>>>> utilization with some Asian locales. > >>>>>>> > >>>>>>> Technically speaking UTF-8 and UTF-16 are different > encodings of > >>>>>>> the same character set, so the internal impact of > allowing both > >>>>>>> would be minimal (but still significant). And the conversion > >>>>>>> between the two is rather trivial. > >>>>>>> > >>>>>>> An added advantage of UTF-16 is that all characters > are fixed size, > >>>>>>> so it is easy to calculate space of character string > given the > >>>>>>> number of characters. > >>>>>> Nitpicking: Not quite, some characters will be represented by > >>>>>> surrogate pairs so it's not that easy to calculate > space after all if > >>>>>> you were to be strictly UTF-16 compliant. There are > now (Unicode 5.0) > >>>>>> assigned "CJK Unified Ideographs Extension B" in SIP > (Supplemental > >>>>>> Ideographic Plane) in the range 0x20000-0x2a6df and > 0x2a700-0x2fa1f. > >>>>>> > >>>>>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's > >>>>>> assumption will hold. > >>>>>> > >>>>>> And of course I agree with Roy. Do support UTF-8, > UTF-16 and maybe > >>>>>> UTF-32 too. > >>> > >>> _______________________________________________ > >>> Mailing list: https://launchpad.net/~drizzle-discuss > >>> Post to : [email protected] > >>> Unsubscribe : https://launchpad.net/~drizzle-discuss > >>> More help : https://help.launchpad.net/ListHelp > >> _______________________________________________ > >> Mailing list: https://launchpad.net/~drizzle-discuss > >> Post to : [email protected] > >> Unsubscribe : https://launchpad.net/~drizzle-discuss > >> More help : https://help.launchpad.net/ListHelp > > > > _______________________________________________ > > Mailing list: https://launchpad.net/~drizzle-discuss > > Post to : [email protected] > > Unsubscribe : https://launchpad.net/~drizzle-discuss > > More help : https://help.launchpad.net/ListHelp _______________________________________________ Mailing list: https://launchpad.net/~drizzle-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~drizzle-discuss More help : https://help.launchpad.net/ListHelp

