Yoshi, I fully agree with you on decoupling the collation and the charset. That work will be done at some point.
Regarding pluggable character sets, the idea is certainly in-line with the idea of Drizzle being pluggable, modular and extensible, so I don't really see any conflict from a "vision" perspective. That said, I think at this point the benefits we see in simplification of the code base through limiting to UTF8 charset is demonstrable. I think it makes sense to proceed with our current direction (of having only UTF8 and multiple collations) and then add pluggable charsets back into server core at a later point when the plugin API is refactored. To do that: a) The CHARSET_INFO struct must be refactored to remove the MY_COLLATION_HANDLER pointer. b) The MY_CHARSET_HANDLER struct should be refactored into either a class which inherits from a base Plugin class or should be turned into a type of plugin handler under the existing st_plugin with a load of function pointer members stuff Right now, we can do a) fairly easily (maybe 1 week of work for a developer), but b) is not so easy until we make a concerted effort to make the plugin API easier to extend and to work with, IMHO. Regardless, your idea is a good one. Bernt and Roy, I assume if we did the above, that would satisfy your points about UTF16 and 32? Cheers, Jay Bernt M. Johnsen wrote: >>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16): >> Another approach would be to create a database in either UTF-8 or UTF-16 >> character set. UTF-16 obviously provides a better storage utilization >> with some Asian locales. >> >> Technically speaking UTF-8 and UTF-16 are different encodings of the >> same character set, so the internal impact of allowing both would be >> minimal (but still significant). And the conversion between the two is >> rather trivial. >> >> An added advantage of UTF-16 is that all characters are fixed size, so >> it is easy to calculate space of character string given the number of >> characters. > > Nitpicking: Not quite, some characters will be represented by > surrogate pairs so it's not that easy to calculate space after all if > you were to be strictly UTF-16 compliant. There are now (Unicode 5.0) > assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental > Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f. > > But as log as we stick to BMP (Basic Multilingual Plane) Roy's > assumption will hold. > > And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe > UTF-32 too. > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Mailing list: https://launchpad.net/~drizzle-discuss > Post to : [email protected] > Unsubscribe : https://launchpad.net/~drizzle-discuss > More help : https://help.launchpad.net/ListHelp _______________________________________________ Mailing list: https://launchpad.net/~drizzle-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~drizzle-discuss More help : https://help.launchpad.net/ListHelp

