I think that I would rather have a single character set inside the
server, but possibly allow client character set to be a different one.
Thus, an application that must be compliant with CJK can still interface
with the database.
Does that make sense?
Roy
Jay Pipes wrote:
Roy Lyseng wrote:
Jay Pipes wrote:
Yoshi, I fully agree with you on decoupling the collation and the
charset. That work will be done at some point.
Regarding pluggable character sets, the idea is certainly in-line with
the idea of Drizzle being pluggable, modular and extensible, so I don't
really see any conflict from a "vision" perspective. That said, I think
at this point the benefits we see in simplification of the code base
through limiting to UTF8 charset is demonstrable. I think it makes
sense to proceed with our current direction (of having only UTF8 and
multiple collations) and then add pluggable charsets back into server
core at a later point when the plugin API is refactored.
To do that:
a) The CHARSET_INFO struct must be refactored to remove the
MY_COLLATION_HANDLER pointer.
b) The MY_CHARSET_HANDLER struct should be refactored into either a
class which inherits from a base Plugin class or should be turned into a
type of plugin handler under the existing st_plugin with a load of
function pointer members stuff
Right now, we can do a) fairly easily (maybe 1 week of work for a
developer), but b) is not so easy until we make a concerted effort to
make the plugin API easier to extend and to work with, IMHO.
Regardless, your idea is a good one.
Bernt and Roy,
I assume if we did the above, that would satisfy your points about UTF16
and 32?
Slight difference: Because UTF-8/16/32 are equivalent and
interchangable, you could reconfigure (probably before creating the
initial database) and still have the same internal functionality. If you
allow pluggable character sets, you must address multiple simultaneous
character sets, character set conversions, introducers, you name it...
Hmm, good points...perhaps the best way to approach this initially is to
make the collations pluggable and then, if the desire is there, add in
pluggable charsets at a later point. Either that, or limit the multiple
charset operations. For instance, don't allow introducers but do allow
the client to do SET NAMES. Don't allow CONVERT(charset1 TO charset2)
but do allow indexes to be stored in a specific charset, etc.
The simplicity we've reached from narrowing to only support UTF8 is
mainly maninfested in reduction of the parser and if adding pluggable
charsets back into the server increases the complexity of the parser
again, it's going to be a tough sell, particularly to Brian (and me and
others..)
Cheers, and thanks for the input!
Jay
Cheers,
Jay
Bernt M. Johnsen wrote:
Roy Lyseng wrote (2008-09-30 08:33:16):
Another approach would be to create a database in either UTF-8 or
UTF-16 character set. UTF-16 obviously provides a better storage
utilization with some Asian locales.
Technically speaking UTF-8 and UTF-16 are different encodings of
the same character set, so the internal impact of allowing both
would be minimal (but still significant). And the conversion
between the two is rather trivial.
An added advantage of UTF-16 is that all characters are fixed size,
so it is easy to calculate space of character string given the
number of characters.
Nitpicking: Not quite, some characters will be represented by
surrogate pairs so it's not that easy to calculate space after all if
you were to be strictly UTF-16 compliant. There are now (Unicode 5.0)
assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental
Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f.
But as log as we stick to BMP (Basic Multilingual Plane) Roy's
assumption will hold.
And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe
UTF-32 too.
_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help : https://help.launchpad.net/ListHelp
_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help : https://help.launchpad.net/ListHelp