Roy Lyseng wrote: > Jay Pipes wrote: >> Yoshi, I fully agree with you on decoupling the collation and the >> charset. That work will be done at some point. >> >> Regarding pluggable character sets, the idea is certainly in-line with >> the idea of Drizzle being pluggable, modular and extensible, so I don't >> really see any conflict from a "vision" perspective. That said, I think >> at this point the benefits we see in simplification of the code base >> through limiting to UTF8 charset is demonstrable. I think it makes >> sense to proceed with our current direction (of having only UTF8 and >> multiple collations) and then add pluggable charsets back into server >> core at a later point when the plugin API is refactored. >> >> To do that: >> >> a) The CHARSET_INFO struct must be refactored to remove the >> MY_COLLATION_HANDLER pointer. >> >> b) The MY_CHARSET_HANDLER struct should be refactored into either a >> class which inherits from a base Plugin class or should be turned into a >> type of plugin handler under the existing st_plugin with a load of >> function pointer members stuff >> >> Right now, we can do a) fairly easily (maybe 1 week of work for a >> developer), but b) is not so easy until we make a concerted effort to >> make the plugin API easier to extend and to work with, IMHO. >> >> Regardless, your idea is a good one. >> >> Bernt and Roy, >> >> I assume if we did the above, that would satisfy your points about UTF16 >> and 32? > > Slight difference: Because UTF-8/16/32 are equivalent and > interchangable, you could reconfigure (probably before creating the > initial database) and still have the same internal functionality. If you > allow pluggable character sets, you must address multiple simultaneous > character sets, character set conversions, introducers, you name it...
Hmm, good points...perhaps the best way to approach this initially is to make the collations pluggable and then, if the desire is there, add in pluggable charsets at a later point. Either that, or limit the multiple charset operations. For instance, don't allow introducers but do allow the client to do SET NAMES. Don't allow CONVERT(charset1 TO charset2) but do allow indexes to be stored in a specific charset, etc. The simplicity we've reached from narrowing to only support UTF8 is mainly maninfested in reduction of the parser and if adding pluggable charsets back into the server increases the complexity of the parser again, it's going to be a tough sell, particularly to Brian (and me and others..) Cheers, and thanks for the input! Jay >> >> Cheers, >> >> Jay >> >> Bernt M. Johnsen wrote: >>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16): >>>> Another approach would be to create a database in either UTF-8 or >>>> UTF-16 character set. UTF-16 obviously provides a better storage >>>> utilization with some Asian locales. >>>> >>>> Technically speaking UTF-8 and UTF-16 are different encodings of >>>> the same character set, so the internal impact of allowing both >>>> would be minimal (but still significant). And the conversion >>>> between the two is rather trivial. >>>> >>>> An added advantage of UTF-16 is that all characters are fixed size, >>>> so it is easy to calculate space of character string given the >>>> number of characters. >>> Nitpicking: Not quite, some characters will be represented by >>> surrogate pairs so it's not that easy to calculate space after all if >>> you were to be strictly UTF-16 compliant. There are now (Unicode 5.0) >>> assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental >>> Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f. >>> >>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's >>> assumption will hold. >>> >>> And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe >>> UTF-32 too. _______________________________________________ Mailing list: https://launchpad.net/~drizzle-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~drizzle-discuss More help : https://help.launchpad.net/ListHelp

