Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Jay Pipes Tue, 30 Sep 2008 07:43:10 -0700

Roy Lyseng wrote:
> Jay Pipes wrote:
>> Yoshi, I fully agree with you on decoupling the collation and the
>> charset.  That work will be done at some point.
>>
>> Regarding pluggable character sets, the idea is certainly in-line with
>> the idea of Drizzle being pluggable, modular and extensible, so I don't
>> really see any conflict from a "vision" perspective.  That said, I think
>> at this point the benefits we see in simplification of the code base
>> through limiting to UTF8 charset is demonstrable.  I think it makes
>> sense to proceed with our current direction (of having only UTF8 and
>> multiple collations) and then add pluggable charsets back into server
>> core at a later point when the plugin API is refactored.
>>
>> To do that:
>>
>> a) The CHARSET_INFO struct must be refactored to remove the
>> MY_COLLATION_HANDLER pointer.
>>
>> b) The MY_CHARSET_HANDLER struct should be refactored into either a
>> class which inherits from a base Plugin class or should be turned into a
>> type of plugin handler under the existing st_plugin with a load of
>> function pointer members stuff
>>
>> Right now, we can do a) fairly easily (maybe 1 week of work for a
>> developer), but b) is not so easy until we make a concerted effort to
>> make the plugin API easier to extend and to work with, IMHO.
>>
>> Regardless, your idea is a good one.
>>
>> Bernt and Roy,
>>
>> I assume if we did the above, that would satisfy your points about UTF16
>> and 32?
> 
> Slight difference: Because UTF-8/16/32 are equivalent and
> interchangable, you could reconfigure (probably before creating the
> initial database) and still have the same internal functionality. If you
> allow pluggable character sets, you must address multiple simultaneous
> character sets, character set conversions, introducers, you name it...


Hmm, good points...perhaps the best way to approach this initially is to
make the collations pluggable and then, if the desire is there, add in
pluggable charsets at a later point.  Either that, or limit the multiple
charset operations.  For instance, don't allow introducers but do allow
the client to do SET NAMES.  Don't allow CONVERT(charset1 TO charset2)
but do allow indexes to be stored in a specific charset, etc.

The simplicity we've reached from narrowing to only support UTF8 is
mainly maninfested in reduction of the parser and if adding pluggable
charsets back into the server increases the complexity of the parser
again, it's going to be a tough sell, particularly to Brian (and me and
others..)

Cheers, and thanks for the input!

Jay

>>
>> Cheers,
>>
>> Jay
>>
>> Bernt M. Johnsen wrote:
>>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16):
>>>> Another approach would be to create a database in either UTF-8 or
>>>> UTF-16  character set. UTF-16 obviously provides a better storage
>>>> utilization  with some Asian locales.
>>>>
>>>> Technically speaking UTF-8 and UTF-16 are different encodings of
>>>> the  same character set, so the internal impact of allowing both
>>>> would be  minimal (but still significant). And the conversion
>>>> between the two is  rather trivial.
>>>>
>>>> An added advantage of UTF-16 is that all characters are fixed size,
>>>> so  it is easy to calculate space of character string given the
>>>> number of  characters.
>>> Nitpicking: Not quite, some characters will be represented by
>>> surrogate pairs so it's not that easy to calculate space after all if
>>> you were to be strictly UTF-16 compliant. There are now (Unicode 5.0)
>>> assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental
>>> Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f.
>>>
>>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's
>>> assumption will hold.
>>>
>>> And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe
>>> UTF-32 too.


_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to