Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Bernt M. Johnsen Tue, 30 Sep 2008 01:57:23 -0700

>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16):
> Another approach would be to create a database in either UTF-8 or UTF-16  
> character set. UTF-16 obviously provides a better storage utilization  
> with some Asian locales.
>
> Technically speaking UTF-8 and UTF-16 are different encodings of the  
> same character set, so the internal impact of allowing both would be  
> minimal (but still significant). And the conversion between the two is  
> rather trivial.
>
> An added advantage of UTF-16 is that all characters are fixed size, so  
> it is easy to calculate space of character string given the number of  
> characters.


Nitpicking: Not quite, some characters will be represented by
surrogate pairs so it's not that easy to calculate space after all if
you were to be strictly UTF-16 compliant. There are now (Unicode 5.0)
assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental
Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f.

But as log as we stick to BMP (Basic Multilingual Plane) Roy's
assumption will hold.

And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe
UTF-32 too.

-- 
Bernt Marius Johnsen, Staff Engineer
Database Technology Group, Sun Microsystems, Trondheim, Norway

signature.asc
Description: Digital signature

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to