Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Jay Pipes Tue, 30 Sep 2008 07:57:53 -0700

Sure, understood, but it doesn't assuage Yoshi's concern about 34%
increase in storage requirements versus native CJK charsets...


-j

Roy Lyseng wrote:
> I think that I would rather have a single character set inside the
> server, but possibly allow client character set to be a different one.
> Thus, an application that must be compliant with CJK can still interface
> with the database.
> 
> Does that make sense?
> 
> Roy
> 
> Jay Pipes wrote:
>> Roy Lyseng wrote:
>>> Jay Pipes wrote:
>>>> Yoshi, I fully agree with you on decoupling the collation and the
>>>> charset.  That work will be done at some point.
>>>>
>>>> Regarding pluggable character sets, the idea is certainly in-line with
>>>> the idea of Drizzle being pluggable, modular and extensible, so I don't
>>>> really see any conflict from a "vision" perspective.  That said, I
>>>> think
>>>> at this point the benefits we see in simplification of the code base
>>>> through limiting to UTF8 charset is demonstrable.  I think it makes
>>>> sense to proceed with our current direction (of having only UTF8 and
>>>> multiple collations) and then add pluggable charsets back into server
>>>> core at a later point when the plugin API is refactored.
>>>>
>>>> To do that:
>>>>
>>>> a) The CHARSET_INFO struct must be refactored to remove the
>>>> MY_COLLATION_HANDLER pointer.
>>>>
>>>> b) The MY_CHARSET_HANDLER struct should be refactored into either a
>>>> class which inherits from a base Plugin class or should be turned
>>>> into a
>>>> type of plugin handler under the existing st_plugin with a load of
>>>> function pointer members stuff
>>>>
>>>> Right now, we can do a) fairly easily (maybe 1 week of work for a
>>>> developer), but b) is not so easy until we make a concerted effort to
>>>> make the plugin API easier to extend and to work with, IMHO.
>>>>
>>>> Regardless, your idea is a good one.
>>>>
>>>> Bernt and Roy,
>>>>
>>>> I assume if we did the above, that would satisfy your points about
>>>> UTF16
>>>> and 32?
>>> Slight difference: Because UTF-8/16/32 are equivalent and
>>> interchangable, you could reconfigure (probably before creating the
>>> initial database) and still have the same internal functionality. If you
>>> allow pluggable character sets, you must address multiple simultaneous
>>> character sets, character set conversions, introducers, you name it...
>>
>> Hmm, good points...perhaps the best way to approach this initially is to
>> make the collations pluggable and then, if the desire is there, add in
>> pluggable charsets at a later point.  Either that, or limit the multiple
>> charset operations.  For instance, don't allow introducers but do allow
>> the client to do SET NAMES.  Don't allow CONVERT(charset1 TO charset2)
>> but do allow indexes to be stored in a specific charset, etc.
>>
>> The simplicity we've reached from narrowing to only support UTF8 is
>> mainly maninfested in reduction of the parser and if adding pluggable
>> charsets back into the server increases the complexity of the parser
>> again, it's going to be a tough sell, particularly to Brian (and me and
>> others..)
>>
>> Cheers, and thanks for the input!
>>
>> Jay
>>
>>>> Cheers,
>>>>
>>>> Jay
>>>>
>>>> Bernt M. Johnsen wrote:
>>>>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16):
>>>>>> Another approach would be to create a database in either UTF-8 or
>>>>>> UTF-16  character set. UTF-16 obviously provides a better storage
>>>>>> utilization  with some Asian locales.
>>>>>>
>>>>>> Technically speaking UTF-8 and UTF-16 are different encodings of
>>>>>> the  same character set, so the internal impact of allowing both
>>>>>> would be  minimal (but still significant). And the conversion
>>>>>> between the two is  rather trivial.
>>>>>>
>>>>>> An added advantage of UTF-16 is that all characters are fixed size,
>>>>>> so  it is easy to calculate space of character string given the
>>>>>> number of  characters.
>>>>> Nitpicking: Not quite, some characters will be represented by
>>>>> surrogate pairs so it's not that easy to calculate space after all if
>>>>> you were to be strictly UTF-16 compliant. There are now (Unicode 5.0)
>>>>> assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental
>>>>> Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f.
>>>>>
>>>>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's
>>>>> assumption will hold.
>>>>>
>>>>> And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe
>>>>> UTF-32 too.
>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~drizzle-discuss
>> Post to     : [email protected]
>> Unsubscribe : https://launchpad.net/~drizzle-discuss
>> More help   : https://help.launchpad.net/ListHelp
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~drizzle-discuss
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~drizzle-discuss
> More help   : https://help.launchpad.net/ListHelp

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to