Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Monty Taylor Tue, 30 Sep 2008 15:14:47 -0700

So, I keep meaning to show some sample code for something here, and I
keep having to do other things. (darn day job...)


So... quick and ugly (and very basic) sample code for using existing
system locale data:

#include <locale>
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>

using namespace std;

bool localeLessThan(const string& s1, const string& s2) {

  locale locale1("zh_CN.UTF8");

  const collate<char>& col= use_facet<collate<char> >(locale1);

  const char* pb1= s1.data();
  const char* pb2= s2.data();

  return (col.compare(pb1, pb1 + s1.size(),
                      pb2, pb2 + s2.size()) < 0);
}

int main(int argc, char** argv) {

  string s2 = "流金岁月";
  string s1 = "人物 王菲出 演同志电影";
  string s3 = "活跃用户";

  vector<string> all_the_strings;

  all_the_strings.push_back(s1);
  all_the_strings.push_back(s2);
  all_the_strings.push_back(s3);

  sort(all_the_strings.begin(), all_the_strings.end(), localeLessThan);
  for (vector<string>::const_iterator p= all_the_strings.begin();
       p != all_the_strings.end();
       ++p)
    cout << *p << endl;

}

[EMAIL PROTECTED]:~$ g++ -o test_locale test_locale.cc
[EMAIL PROTECTED]:~$ ./test_locale
活跃用户
流金岁月
人物 王菲出 演同志电影

So rather than build our own system of dealing with all of this - I'd
love to see us be able to use some of what's already there. Better than
the C version of this, the C++ one seems to understand you might want to
use more than just one global locale. Now, I'm not sure how charsets
enter in to this setup... but the ability is there to deal with
collations, numbers, currency and dates. Any thoughts?

Monty

Jay Pipes wrote:
> Yoshi, I fully agree with you on decoupling the collation and the
> charset.  That work will be done at some point.
> 
> Regarding pluggable character sets, the idea is certainly in-line with
> the idea of Drizzle being pluggable, modular and extensible, so I don't
> really see any conflict from a "vision" perspective.  That said, I think
> at this point the benefits we see in simplification of the code base
> through limiting to UTF8 charset is demonstrable.  I think it makes
> sense to proceed with our current direction (of having only UTF8 and
> multiple collations) and then add pluggable charsets back into server
> core at a later point when the plugin API is refactored.
> 
> To do that:
> 
> a) The CHARSET_INFO struct must be refactored to remove the
> MY_COLLATION_HANDLER pointer.
> 
> b) The MY_CHARSET_HANDLER struct should be refactored into either a
> class which inherits from a base Plugin class or should be turned into a
> type of plugin handler under the existing st_plugin with a load of
> function pointer members stuff
> 
> Right now, we can do a) fairly easily (maybe 1 week of work for a
> developer), but b) is not so easy until we make a concerted effort to
> make the plugin API easier to extend and to work with, IMHO.
> 
> Regardless, your idea is a good one.
> 
> Bernt and Roy,
> 
> I assume if we did the above, that would satisfy your points about UTF16
> and 32?
> 
> Cheers,
> 
> Jay
> 
> Bernt M. Johnsen wrote:
>>>>>>>>>>>>>> Roy Lyseng wrote (2008-09-30 08:33:16):
>>> Another approach would be to create a database in either UTF-8 or UTF-16  
>>> character set. UTF-16 obviously provides a better storage utilization  
>>> with some Asian locales.
>>>
>>> Technically speaking UTF-8 and UTF-16 are different encodings of the  
>>> same character set, so the internal impact of allowing both would be  
>>> minimal (but still significant). And the conversion between the two is  
>>> rather trivial.
>>>
>>> An added advantage of UTF-16 is that all characters are fixed size, so  
>>> it is easy to calculate space of character string given the number of  
>>> characters.
>> Nitpicking: Not quite, some characters will be represented by
>> surrogate pairs so it's not that easy to calculate space after all if
>> you were to be strictly UTF-16 compliant. There are now (Unicode 5.0)
>> assigned "CJK Unified Ideographs Extension B" in SIP (Supplemental
>> Ideographic Plane) in the range 0x20000-0x2a6df and 0x2a700-0x2fa1f.
>>
>> But as log as we stick to BMP (Basic Multilingual Plane) Roy's
>> assumption will hold.
>>
>> And of course I agree with Roy. Do support UTF-8, UTF-16 and maybe
>> UTF-32 too.
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~drizzle-discuss
>> Post to     : [email protected]
>> Unsubscribe : https://launchpad.net/~drizzle-discuss
>> More help   : https://help.launchpad.net/ListHelp
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~drizzle-discuss
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~drizzle-discuss
> More help   : https://help.launchpad.net/ListHelp
> 


_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] Toru's thoughts on UTF8 and CJK charsets

Reply via email to