Re: Unicode sorting and binary comparison, please!

Anders Karlsson Mon, 03 Mar 2008 01:27:46 -0800

Yves!

This is a complicated matter alright, but it is a complicatedproblem to solve here also. Your statement about characters being thesame isn't really correct. To take an example: Let's assume you weredoing a phonebook, in print, of all people in the world. How would yousort that? All names in the book should be printed correctly as the nameis usually printed in the respective country of origin. You wouldrealize that such a phonebook just couldn't be made in just one print.Certain characters, although they can appear (at least as part of aname) are treated differently in different countries.Two examples: The nordic "umlaut / ring" characters: å ä and ö.(aring;, auml; and ouml; in HTML lingo). These are sorted differently inthe different countries where they are used. In sweden, the are last inthe alphabet, in germay, they are usually, IIRC, intermixed with the aand o respectively.Another, and much better, example is the accented characters. In somelanguages, accnts are very important parts of the language, Frenchprobably being the best example here. leaving an accent out would changethings considerably, and with or without an eccent would change the sortorder. In Sweden, accents exist, even in Swedish names, and they changethe pronounciation of the word slightly (although you usually know whatthe intention is, even when it is left out). But the accented charactersare treated, collation wise in any type of listing, phonebooks etc, asthe accent just wasn't there. The names Linden and Lindén are pronounceddifferently, but sorted together as the accent wasn't there at all.

To you specific problem then, the issue is that as we can have justabout every character in the world available in UNICODE (this isn'ttrue, really, but for this discussion, let's assume this is the case).The important thing when you store data is that you allow all thesecharacters to be stored, i.e. the utf8 charcater set is supported. The"collate" specification to the is just the default ordering for thecolumn. Like the phonebook example above, this is how we would sort thecharacters in the phonebook, lets assume we use swedish. Then the nicething with MySQL is that you can allow another sort order and/orcomparison method, like being able to resort the phonebook fornon-swedish people.

As for comparisons, the issue is the same. You don't know, assumingthe phonebook problem above, if someone looking for a person in the bookis French, when accented characters should be properly compared, orswedish, when they are to be ignored. The solution is to say whatlanguage you want, or if you want a binary comparison. If you want toaccknowledge exact matching, and say any character, accented / unlautedetc, is different from any other character, specifiy a binary comparison:

SELECT * FROM phonebook WHERE BINARY name = 'Handel';

Look into the character set casting / conversion functions in theMySQL manual: http://dev.mysql.com/doc/refman/5.0/en/cast-functions.html

Alternatively, you could specify the client collation, which wouldapply to all operations. Or you could create your own collation. I wouldreally like more case sensitive collations myself. Case sensitiveness isalso something that is different for different characters in differentlanguages.


Hope this helps a bit
/Karlsson

Yves Goergen wrote:

Hello,
I've just read through the MySQL documentation about Unicode support,collations and how it affects sorting and comparison of strings. And Ifind it horrible, at least. I feel like I'm back in the MySQL 3.x dayswhere I used UTF-8 in my application and MySQL treated it binary. Theonly problem was incorrect sorting of things. Today we have UTF-8support in MySQL, which brings correct sorting (for whateverdefinition of "correct") but has taken correct comparison again.
When I have three strings, e.g. "Handel", "Händel" and "Hendel", I'dlike to have them sorted correctly. Using theutf8_{general,unicode}_ci collation seems the only way. Now when Iwant the row with "Handel" in it, I'll get two rows back. One of themis not what I wanted. So strictly, the result is incorrect. The onlyway to get this right is using the utf8_bin collation. But this againmakes correct sorting impossible.
It's a nightmare. Why can't I get correct sorting *and* correct (i.e.precise) comparison in one?
If I cannot even rely on the = operator, what good is a text-storingdatabase? There even isn't a case-sensitive unicode collation otherthan utf8_bin. This means that in every database application that usesunicode, I cannot separate lower from uppercase when retrieving stuff.MySQL is simply blind for that. Not to mention different charactersthat Unicode, MySQL, DIN, ISO or whoever think are the same, but theyaren't. If they were the same, you wouldn't need both of them.
Finally, my application should really be portable. I haven't lookedinto how other DBMS handle it and whether the SQL syntax would be thesame, should there be any method on the language layer to do it right.I only know that SQLite stores in UTF-8 but otherwise doesn't careabout Unicode, i.e. sorting should be broken, comparison is correct.PostgreSQL didn't find its own columns again, so I cancelled the test.



--
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /  Anders Karlsson ([EMAIL PROTECTED])
 / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Sales Engineer
/_/  /_/\_, /___/\___\_\___/ Stockholm
       <___/   www.mysql.com Cellphone: +46 708 608121
                              Skype: drdatabase



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Unicode sorting and binary comparison, please!

Reply via email to