Dennis Björklund wrote:

In the future we need indexes that depend on the locale (and a lot of other changes).


I agree. I've been looking at the web on this subject a lot lately. I am **NOT** a microslop fan, but SQL-SERVER even lets a user define a language(maybe encoding) down to the column level!


I've been reading on GNU-C and on languages, encoding, and localization.

http://pauillac.inria.fr/~lang/hotlist/free/licence/fsf96/drepper/paper-1.html
http://h21007.www2.hp.com/dspp/tech/tech_TechSingleTipDetailPage_IDX/1,2366,1222,00.html


There are three basic approaches to doing different langauges in computerized text:


   A/ various adaptations of the 8 bit character set, I.E. the ISO-8859-x series.
       One byte per character.
       Easy storing, small size for a string.
       Easy storing, if english characters, 100% efficient use of storage space.
       Easy processing between applications, works well in the stream model of *nix
       Easy processing in applications, a byte is a character.
       Easy string handling, NOY NULL bytes in a string, except end of string.
       NOT easy to know encoding from inherently in the document.
       This is not the way of the future.

   B/ wide characters
       UTF16, UTF32, SHIFT-JIS-16, others
       each character the same width, 2 or 4 bytes (2 bytes handles 99% of all 
languages)
       Not so easy storing, if english characters, 50% to 75% loss of storage space.
       Difficult processing between applications, does NOT work well in the stream 
model of *nix
       Easy processing in applications, a set width of bits/bytes is a character.
       Difficult string handling, MANY NULL bytes in a string, especially if in 
English.
       Moderately easy to tell encoding/language in the document.
   ********This should be how Postgress stores data internally.********

   C/ Multibyte characters
       UTF8
       variable width for different characters 1-5
       Not so easy storing, if non english characters, 50% to 80% loss of storage 
space,
           (in reality,                 most common western languages hover aournd 
5-20% loss of storage space
               most common non western languages hover aournd 40-60%% loss of storage 
space)
       Easy processing between applications, works well in the stream model of *nix
       Difficult processing in applications, a variable number of bytes is a character.
       Easy string string handling, ONE NULL byte in a string.
       Moderately easy to tell encoding/language in the document.
   ********This is how Postgress should default to sending data OUT of the application,
           i.e. to the display or the web, or other system applications********






---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
     subscribe-nomail command to [EMAIL PROTECTED] so that your
     message can get through to the mailing list cleanly

Reply via email to