Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Jim Starkey Tue, 30 Sep 2008 09:28:09 -0700

Jay Pipes wrote:

Roy Lyseng wrote:

Jay Pipes wrote:

Yoshi, I fully agree with you on decoupling the collation and the
charset.  That work will be done at some point.


Regarding pluggable character sets, the idea is certainly in-line with
the idea of Drizzle being pluggable, modular and extensible, so I don't
really see any conflict from a "vision" perspective.  That said, I think
at this point the benefits we see in simplification of the code base
through limiting to UTF8 charset is demonstrable.  I think it makes
sense to proceed with our current direction (of having only UTF8 and
multiple collations) and then add pluggable charsets back into server
core at a later point when the plugin API is refactored.

To do that:

a) The CHARSET_INFO struct must be refactored to remove the
MY_COLLATION_HANDLER pointer.

b) The MY_CHARSET_HANDLER struct should be refactored into either a
class which inherits from a base Plugin class or should be turned into a
type of plugin handler under the existing st_plugin with a load of
function pointer members stuff

Right now, we can do a) fairly easily (maybe 1 week of work for a
developer), but b) is not so easy until we make a concerted effort to
make the plugin API easier to extend and to work with, IMHO.

Regardless, your idea is a good one.

Bernt and Roy,

I assume if we did the above, that would satisfy your points about UTF16
and 32?

Slight difference: Because UTF-8/16/32 are equivalent and
interchangable, you could reconfigure (probably before creating the
initial database) and still have the same internal functionality. If you
allow pluggable character sets, you must address multiple simultaneous
character sets, character set conversions, introducers, you name it...


Hmm, good points...perhaps the best way to approach this initially is to
make the collations pluggable and then, if the desire is there, add in
pluggable charsets at a later point.  Either that, or limit the multiple
charset operations.  For instance, don't allow introducers but do allow
the client to do SET NAMES.  Don't allow CONVERT(charset1 TO charset2)
but do allow indexes to be stored in a specific charset, etc.

The simplicity we've reached from narrowing to only support UTF8 is
mainly maninfested in reduction of the parser and if adding pluggable
charsets back into the server increases the complexity of the parser
again, it's going to be a tough sell, particularly to Brian (and me and
others..)

A single internal character set means that string needn't be typed bycharacter set, there needn't be any character set conversions, thereneedn't be any runtime testing for character set, etc etc etc.

UTF8, UTF16, and UTF32 are semantically equivalent. Some character setyield short strings when mapped in UTF-8, others yield shorter stringswhen mapped in UTF-16. Any general claim that either is more efficientor denser is false.

There is no utility in multiple character sets. Multiple character setsdon't give additional capabilities, there isn't a general savings inspace, and in any case are transparent to users. All they add iscomplexity and overhead.

Ripping out all character set handling in favor of hardcoded UTF8 (orUTF16 or UTF32) reduces code size, complexity, and general goodness.Note, however, that making character sets pluggable requires fullcharacter set handling semantic whether pluggable characters arepresent.or not (this is known as "architecture tax").


--
Jim Starkey
President, NimbusDB, Inc.
978 526-1376


_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to