Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Jay Pipes Tue, 30 Sep 2008 07:17:14 -0700

Yoshi, I fully agree with you on decoupling the collation and the
charset.  That work will be done at some point.

Regarding pluggable character sets, the idea is certainly in-line with
the idea of Drizzle being pluggable, modular and extensible, so I don't
really see any conflict from a "vision" perspective.  That said, I think
at this point the benefits we see in simplification of the code base
through limiting to UTF8 charset is demonstrable.  I think it makes
sense to proceed with our current direction (of having only UTF8 and
multiple collations) and then add pluggable charsets back into server
core at a later point when the plugin API is refactored.

To do that:

a) The CHARSET_INFO struct must be refactored to remove the
MY_COLLATION_HANDLER pointer.

b) The MY_CHARSET_HANDLER struct should be refactored into either a
class which inherits from a base Plugin class or should be turned into a
type of plugin handler under the existing st_plugin with a load of
function pointer members stuff

Right now, we can do a) fairly easily (maybe 1 week of work for a
developer), but b) is not so easy until we make a concerted effort to
make the plugin API easier to extend and to work with, IMHO.

Regardless, your idea is a good one.

Bernt and Roy,

I assume if we did the above, that would satisfy your points about UTF16
and 32?

Cheers,

Jay

Bernt M. Johnsen wrote:
>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16):
>> Another approach would be to create a database in either UTF-8 or UTF-16  
>> character set. UTF-16 obviously provides a better storage utilization  
>> with some Asian locales.
>>
>> Technically speaking UTF-8 and UTF-16 are different encodings of the  
>> same character set, so the internal impact of allowing both would be  
>> minimal (but still significant). And the conversion between the two is  
>> rather trivial.
>>
>> An added advantage of UTF-16 is that all characters are fixed size, so  
>> it is easy to calculate space of character string given the number of  
>> characters.
> 
> Nitpicking: Not quite, some characters will be represented by
> surrogate pairs so it's not that easy to calculate space after all if
> you were to be strictly UTF-16 compliant. There are now (Unicode 5.0)
> assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental
> Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f.
> 
> But as log as we stick to BMP (Basic Multilingual Plane) Roy's
> assumption will hold.
> 
> And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe
> UTF-32 too.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~drizzle-discuss
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~drizzle-discuss
> More help   : https://help.launchpad.net/ListHelp

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to