Sure, understood, but it doesn't assuage Yoshi's concern about 34% increase in storage requirements versus native CJK charsets...
-j Roy Lyseng wrote: > I think that I would rather have a single character set inside the > server, but possibly allow client character set to be a different one. > Thus, an application that must be compliant with CJK can still interface > with the database. > > Does that make sense? > > Roy > > Jay Pipes wrote: >> Roy Lyseng wrote: >>> Jay Pipes wrote: >>>> Yoshi, I fully agree with you on decoupling the collation and the >>>> charset. That work will be done at some point. >>>> >>>> Regarding pluggable character sets, the idea is certainly in-line with >>>> the idea of Drizzle being pluggable, modular and extensible, so I don't >>>> really see any conflict from a "vision" perspective. That said, I >>>> think >>>> at this point the benefits we see in simplification of the code base >>>> through limiting to UTF8 charset is demonstrable. I think it makes >>>> sense to proceed with our current direction (of having only UTF8 and >>>> multiple collations) and then add pluggable charsets back into server >>>> core at a later point when the plugin API is refactored. >>>> >>>> To do that: >>>> >>>> a) The CHARSET_INFO struct must be refactored to remove the >>>> MY_COLLATION_HANDLER pointer. >>>> >>>> b) The MY_CHARSET_HANDLER struct should be refactored into either a >>>> class which inherits from a base Plugin class or should be turned >>>> into a >>>> type of plugin handler under the existing st_plugin with a load of >>>> function pointer members stuff >>>> >>>> Right now, we can do a) fairly easily (maybe 1 week of work for a >>>> developer), but b) is not so easy until we make a concerted effort to >>>> make the plugin API easier to extend and to work with, IMHO. >>>> >>>> Regardless, your idea is a good one. >>>> >>>> Bernt and Roy, >>>> >>>> I assume if we did the above, that would satisfy your points about >>>> UTF16 >>>> and 32? >>> Slight difference: Because UTF-8/16/32 are equivalent and >>> interchangable, you could reconfigure (probably before creating the >>> initial database) and still have the same internal functionality. If you >>> allow pluggable character sets, you must address multiple simultaneous >>> character sets, character set conversions, introducers, you name it... >> >> Hmm, good points...perhaps the best way to approach this initially is to >> make the collations pluggable and then, if the desire is there, add in >> pluggable charsets at a later point. Either that, or limit the multiple >> charset operations. For instance, don't allow introducers but do allow >> the client to do SET NAMES. Don't allow CONVERT(charset1 TO charset2) >> but do allow indexes to be stored in a specific charset, etc. >> >> The simplicity we've reached from narrowing to only support UTF8 is >> mainly maninfested in reduction of the parser and if adding pluggable >> charsets back into the server increases the complexity of the parser >> again, it's going to be a tough sell, particularly to Brian (and me and >> others..) >> >> Cheers, and thanks for the input! >> >> Jay >> >>>> Cheers, >>>> >>>> Jay >>>> >>>> Bernt M. Johnsen wrote: >>>>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16): >>>>>> Another approach would be to create a database in either UTF-8 or >>>>>> UTF-16 character set. UTF-16 obviously provides a better storage >>>>>> utilization with some Asian locales. >>>>>> >>>>>> Technically speaking UTF-8 and UTF-16 are different encodings of >>>>>> the same character set, so the internal impact of allowing both >>>>>> would be minimal (but still significant). And the conversion >>>>>> between the two is rather trivial. >>>>>> >>>>>> An added advantage of UTF-16 is that all characters are fixed size, >>>>>> so it is easy to calculate space of character string given the >>>>>> number of characters. >>>>> Nitpicking: Not quite, some characters will be represented by >>>>> surrogate pairs so it's not that easy to calculate space after all if >>>>> you were to be strictly UTF-16 compliant. There are now (Unicode 5.0) >>>>> assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental >>>>> Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f. >>>>> >>>>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's >>>>> assumption will hold. >>>>> >>>>> And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe >>>>> UTF-32 too. >> >> >> _______________________________________________ >> Mailing list: https://launchpad.net/~drizzle-discuss >> Post to : [email protected] >> Unsubscribe : https://launchpad.net/~drizzle-discuss >> More help : https://help.launchpad.net/ListHelp > > _______________________________________________ > Mailing list: https://launchpad.net/~drizzle-discuss > Post to : [email protected] > Unsubscribe : https://launchpad.net/~drizzle-discuss > More help : https://help.launchpad.net/ListHelp _______________________________________________ Mailing list: https://launchpad.net/~drizzle-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~drizzle-discuss More help : https://help.launchpad.net/ListHelp

