Actually UTF-16 is not so space efficient even for East Asia users 
because ASCII values are 2 bytes. 
A-Z, 0-9, white space etc are frequently used. 

In Wikipedia-Japan case, the size was 2674MB in UTF-16. 
(2700MB in UTF-8, 2013MB in local encoding)

Having two columns (UTF8_text, UTF_16_text), 
inserting into UTF8_text for ascii-mostly-values, 
inserting into UTF16_text for multibyte-mostly-values
would alleviate size penalty, but this seems too tricky from application
perspective. 

Regards,
----
Yoshinori Matsunobu
Senior MySQL Consultant
Sun Microsystems

MySQL Consulting Services:
http://www-jp.mysql.com/consulting/ 

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 01, 2008 2:12 AM
> To: Jay Pipes
> Cc: Yoshinori Matsunobu; 'drizzle-discuss'; Bernt M. Johnsen
> Subject: Re: [Drizzle-discuss] Toru's thoughts on UTF8 and 
> CJK charsets
> 
> For a japanese application, UTF-8 demands 34% increased 
> storage capacity 
> compared to CJK, but UTF-16 should have the same requirement...
> 
> Cheers,
> Roy
> 
> Jay Pipes wrote:
> > Sure, understood, but it doesn't assuage Yoshi's concern about 34%
> > increase in storage requirements versus native CJK charsets...
> > 
> > -j
> > 
> > Roy Lyseng wrote:
> >> I think that I would rather have a single character set inside the
> >> server, but possibly allow client character set to be a 
> different one.
> >> Thus, an application that must be compliant with CJK can 
> still interface
> >> with the database.
> >>
> >> Does that make sense?
> >>
> >> Roy
> >>
> >> Jay Pipes wrote:
> >>> Roy Lyseng wrote:
> >>>> Jay Pipes wrote:
> >>>>> Yoshi, I fully agree with you on decoupling the 
> collation and the
> >>>>> charset.  That work will be done at some point.
> >>>>>
> >>>>> Regarding pluggable character sets, the idea is 
> certainly in-line with
> >>>>> the idea of Drizzle being pluggable, modular and 
> extensible, so I don't
> >>>>> really see any conflict from a "vision" perspective.  
> That said, I
> >>>>> think
> >>>>> at this point the benefits we see in simplification of 
> the code base
> >>>>> through limiting to UTF8 charset is demonstrable.  I 
> think it makes
> >>>>> sense to proceed with our current direction (of having 
> only UTF8 and
> >>>>> multiple collations) and then add pluggable charsets 
> back into server
> >>>>> core at a later point when the plugin API is refactored.
> >>>>>
> >>>>> To do that:
> >>>>>
> >>>>> a) The CHARSET_INFO struct must be refactored to remove the
> >>>>> MY_COLLATION_HANDLER pointer.
> >>>>>
> >>>>> b) The MY_CHARSET_HANDLER struct should be refactored 
> into either a
> >>>>> class which inherits from a base Plugin class or should 
> be turned
> >>>>> into a
> >>>>> type of plugin handler under the existing st_plugin 
> with a load of
> >>>>> function pointer members stuff
> >>>>>
> >>>>> Right now, we can do a) fairly easily (maybe 1 week of 
> work for a
> >>>>> developer), but b) is not so easy until we make a 
> concerted effort to
> >>>>> make the plugin API easier to extend and to work with, IMHO.
> >>>>>
> >>>>> Regardless, your idea is a good one.
> >>>>>
> >>>>> Bernt and Roy,
> >>>>>
> >>>>> I assume if we did the above, that would satisfy your 
> points about
> >>>>> UTF16
> >>>>> and 32?
> >>>> Slight difference: Because UTF-8/16/32 are equivalent and
> >>>> interchangable, you could reconfigure (probably before 
> creating the
> >>>> initial database) and still have the same internal 
> functionality. If you
> >>>> allow pluggable character sets, you must address 
> multiple simultaneous
> >>>> character sets, character set conversions, introducers, 
> you name it...
> >>> Hmm, good points...perhaps the best way to approach this 
> initially is to
> >>> make the collations pluggable and then, if the desire is 
> there, add in
> >>> pluggable charsets at a later point.  Either that, or 
> limit the multiple
> >>> charset operations.  For instance, don't allow 
> introducers but do allow
> >>> the client to do SET NAMES.  Don't allow CONVERT(charset1 
> TO charset2)
> >>> but do allow indexes to be stored in a specific charset, etc.
> >>>
> >>> The simplicity we've reached from narrowing to only 
> support UTF8 is
> >>> mainly maninfested in reduction of the parser and if 
> adding pluggable
> >>> charsets back into the server increases the complexity of 
> the parser
> >>> again, it's going to be a tough sell, particularly to 
> Brian (and me and
> >>> others..)
> >>>
> >>> Cheers, and thanks for the input!
> >>>
> >>> Jay
> >>>
> >>>>> Cheers,
> >>>>>
> >>>>> Jay
> >>>>>
> >>>>> Bernt M. Johnsen wrote:
> >>>>>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16):
> >>>>>>> Another approach would be to create a database in 
> either UTF-8 or
> >>>>>>> UTF-16  character set. UTF-16 obviously provides a 
> better storage
> >>>>>>> utilization  with some Asian locales.
> >>>>>>>
> >>>>>>> Technically speaking UTF-8 and UTF-16 are different 
> encodings of
> >>>>>>> the  same character set, so the internal impact of 
> allowing both
> >>>>>>> would be  minimal (but still significant). And the conversion
> >>>>>>> between the two is  rather trivial.
> >>>>>>>
> >>>>>>> An added advantage of UTF-16 is that all characters 
> are fixed size,
> >>>>>>> so  it is easy to calculate space of character string 
> given the
> >>>>>>> number of  characters.
> >>>>>> Nitpicking: Not quite, some characters will be represented by
> >>>>>> surrogate pairs so it's not that easy to calculate 
> space after all if
> >>>>>> you were to be strictly UTF-16 compliant. There are 
> now (Unicode 5.0)
> >>>>>> assigned "CJK Unified Ideographs Extension B" in SIP 
> (Supplemental
> >>>>>> Ideographic Plane) in the range 0x20000-0x2a6df and 
> 0x2a700-0x2fa1f.
> >>>>>>
> >>>>>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's
> >>>>>> assumption will hold.
> >>>>>>
> >>>>>> And of course I agree with Roy. Do support UTF-8, 
> UTF-16 and maybe
> >>>>>> UTF-32 too.
> >>>
> >>> _______________________________________________
> >>> Mailing list: https://launchpad.net/~drizzle-discuss
> >>> Post to     : [email protected]
> >>> Unsubscribe : https://launchpad.net/~drizzle-discuss
> >>> More help   : https://help.launchpad.net/ListHelp
> >> _______________________________________________
> >> Mailing list: https://launchpad.net/~drizzle-discuss
> >> Post to     : [email protected]
> >> Unsubscribe : https://launchpad.net/~drizzle-discuss
> >> More help   : https://help.launchpad.net/ListHelp
> > 
> > _______________________________________________
> > Mailing list: https://launchpad.net/~drizzle-discuss
> > Post to     : [email protected]
> > Unsubscribe : https://launchpad.net/~drizzle-discuss
> > More help   : https://help.launchpad.net/ListHelp


_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to