Yves!

This is a complicated matter alright, but it is a complicated problem to solve here also. Your statement about characters being the same isn't really correct. To take an example: Let's assume you were doing a phonebook, in print, of all people in the world. How would you sort that? All names in the book should be printed correctly as the name is usually printed in the respective country of origin. You would realize that such a phonebook just couldn't be made in just one print. Certain characters, although they can appear (at least as part of a name) are treated differently in different countries. Two examples: The nordic "umlaut / ring" characters: å ä and ö. (aring;, auml; and ouml; in HTML lingo). These are sorted differently in the different countries where they are used. In sweden, the are last in the alphabet, in germay, they are usually, IIRC, intermixed with the a and o respectively. Another, and much better, example is the accented characters. In some languages, accnts are very important parts of the language, French probably being the best example here. leaving an accent out would change things considerably, and with or without an eccent would change the sort order. In Sweden, accents exist, even in Swedish names, and they change the pronounciation of the word slightly (although you usually know what the intention is, even when it is left out). But the accented characters are treated, collation wise in any type of listing, phonebooks etc, as the accent just wasn't there. The names Linden and Lindén are pronounced differently, but sorted together as the accent wasn't there at all.

To you specific problem then, the issue is that as we can have just about every character in the world available in UNICODE (this isn't true, really, but for this discussion, let's assume this is the case). The important thing when you store data is that you allow all these characters to be stored, i.e. the utf8 charcater set is supported. The "collate" specification to the is just the default ordering for the column. Like the phonebook example above, this is how we would sort the characters in the phonebook, lets assume we use swedish. Then the nice thing with MySQL is that you can allow another sort order and/or comparison method, like being able to resort the phonebook for non-swedish people.

As for comparisons, the issue is the same. You don't know, assuming the phonebook problem above, if someone looking for a person in the book is French, when accented characters should be properly compared, or swedish, when they are to be ignored. The solution is to say what language you want, or if you want a binary comparison. If you want to accknowledge exact matching, and say any character, accented / unlauted etc, is different from any other character, specifiy a binary comparison:
SELECT * FROM phonebook WHERE BINARY name = 'Handel';

Look into the character set casting / conversion functions in the MySQL manual: http://dev.mysql.com/doc/refman/5.0/en/cast-functions.html

Alternatively, you could specify the client collation, which would apply to all operations. Or you could create your own collation. I would really like more case sensitive collations myself. Case sensitiveness is also something that is different for different characters in different languages.

Hope this helps a bit
/Karlsson

Yves Goergen wrote:
Hello,

I've just read through the MySQL documentation about Unicode support, collations and how it affects sorting and comparison of strings. And I find it horrible, at least. I feel like I'm back in the MySQL 3.x days where I used UTF-8 in my application and MySQL treated it binary. The only problem was incorrect sorting of things. Today we have UTF-8 support in MySQL, which brings correct sorting (for whatever definition of "correct") but has taken correct comparison again.

When I have three strings, e.g. "Handel", "Händel" and "Hendel", I'd like to have them sorted correctly. Using the utf8_{general,unicode}_ci collation seems the only way. Now when I want the row with "Handel" in it, I'll get two rows back. One of them is not what I wanted. So strictly, the result is incorrect. The only way to get this right is using the utf8_bin collation. But this again makes correct sorting impossible.

It's a nightmare. Why can't I get correct sorting *and* correct (i.e. precise) comparison in one?

If I cannot even rely on the = operator, what good is a text-storing database? There even isn't a case-sensitive unicode collation other than utf8_bin. This means that in every database application that uses unicode, I cannot separate lower from uppercase when retrieving stuff. MySQL is simply blind for that. Not to mention different characters that Unicode, MySQL, DIN, ISO or whoever think are the same, but they aren't. If they were the same, you wouldn't need both of them.

Finally, my application should really be portable. I haven't looked into how other DBMS handle it and whether the SQL syntax would be the same, should there be any method on the language layer to do it right. I only know that SQLite stores in UTF-8 but otherwise doesn't care about Unicode, i.e. sorting should be broken, comparison is correct. PostgreSQL didn't find its own columns again, so I cancelled the test.



--
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /  Anders Karlsson ([EMAIL PROTECTED])
 / /|_/ / // /\ \/ /_/ / /__ MySQL AB, Sales Engineer
/_/  /_/\_, /___/\___\_\___/ Stockholm
       <___/   www.mysql.com Cellphone: +46 708 608121
                              Skype: drdatabase



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to