Re: [GENERAL] unicode and sorting(at least)

Dennis Gearon Thu, 24 Jun 2004 09:18:02 -0700

All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimes they can handle several langauges besides English, and are actually designed to handle a family of langauges.

The ONLY encodings that can handle a significant amount of multiple langauges and 
character sets are the ISO/UTF/UCS series. (UCS is giving way to UTF). In fact they 
can handle every human langauge ever used, plus some esoteric ones postulated, and 
there is room for future languages.

So, for a column to handle multiple langauges/character sets, the languages/character 
sets have to be in the family that the database's encoding was defined for(in postgres 
currently, choosing encoding down to the column level is available on several 
databases and is the SQL spec), OR, the encoding for the database has to be UTF8 
(since we don't have UTF16 or UTF32 available)

Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which as an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instance of Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/character set encoding is chosen for the database.

It so happens that most non UTF encodings are designed to sort well in an extended 
ascii/8 bit environment, which is what the ISO8xxxx and LatinX encodings actually are. 
I'm not sure that it's perfect though. So, if SQL_ASCII is chosen for the LOCALE, and 
the encoding is ISO8xxx or LATINx, it will probably sort OK.

UTF8/16/32 is built the same way. However, this only applies per character, and only 
works painlessly on UTF32, which has fixed width characters. UTF8/16 OTOH, have 
variable length characters (in multiples of 8 bits). Since SQL_ASCII sorts in a binary 
fashion, UTF8/16 won't sort correctly under SQL_ASCII locale, I believe.
Tatsuo Ishii wrote:

On Wed, 23 Jun 2004, Dennis Gearon wrote:

This is what has to be eventually done:(as sybase, and probably others do it)
http://www.ianywhere.com/whitepapers/unicode.html


Actually, what probably has to be eventually done is what's in the SQL
spec.

Which is AFAICS basically:
Allow multiple encodings
Allow multiple character sets (within an encoding)

Could Please explain more details for above. In my understanding a
character set can have multiple encodings but...
--
Tatsuo Ishii

Allow one or more collations per character set
Allow columns to specify character set and collation
Allow literals in multiple character sets
Allow translations and encoding conversions (as supported)
Allow explicit COLLATE clauses to control ordering and comparisons.
Handle identifiers in multiple character sets

plus some misc things like allowing sets that control the default
character set for literals for this session and such.


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Re: [GENERAL] unicode and sorting(at least)

Reply via email to