Yves!
This is a complicated matter alright, but it is a complicated
problem to solve here also. Your statement about characters being the
same isn't really correct. To take an example: Let's assume you were
doing a phonebook, in print, of all people in the world. How would you
sort that? All names in the book should be printed correctly as the name
is usually printed in the respective country of origin. You would
realize that such a phonebook just couldn't be made in just one print.
Certain characters, although they can appear (at least as part of a
name) are treated differently in different countries.
Two examples: The nordic "umlaut / ring" characters: å ä and ö.
(aring;, auml; and ouml; in HTML lingo). These are sorted differently in
the different countries where they are used. In sweden, the are last in
the alphabet, in germay, they are usually, IIRC, intermixed with the a
and o respectively.
Another, and much better, example is the accented characters. In some
languages, accnts are very important parts of the language, French
probably being the best example here. leaving an accent out would change
things considerably, and with or without an eccent would change the sort
order. In Sweden, accents exist, even in Swedish names, and they change
the pronounciation of the word slightly (although you usually know what
the intention is, even when it is left out). But the accented characters
are treated, collation wise in any type of listing, phonebooks etc, as
the accent just wasn't there. The names Linden and Lindén are pronounced
differently, but sorted together as the accent wasn't there at all.
To you specific problem then, the issue is that as we can have just
about every character in the world available in UNICODE (this isn't
true, really, but for this discussion, let's assume this is the case).
The important thing when you store data is that you allow all these
characters to be stored, i.e. the utf8 charcater set is supported. The
"collate" specification to the is just the default ordering for the
column. Like the phonebook example above, this is how we would sort the
characters in the phonebook, lets assume we use swedish. Then the nice
thing with MySQL is that you can allow another sort order and/or
comparison method, like being able to resort the phonebook for
non-swedish people.
As for comparisons, the issue is the same. You don't know, assuming
the phonebook problem above, if someone looking for a person in the book
is French, when accented characters should be properly compared, or
swedish, when they are to be ignored. The solution is to say what
language you want, or if you want a binary comparison. If you want to
accknowledge exact matching, and say any character, accented / unlauted
etc, is different from any other character, specifiy a binary comparison:
SELECT * FROM phonebook WHERE BINARY name = 'Handel';
Look into the character set casting / conversion functions in the
MySQL manual: http://dev.mysql.com/doc/refman/5.0/en/cast-functions.html
Alternatively, you could specify the client collation, which would
apply to all operations. Or you could create your own collation. I would
really like more case sensitive collations myself. Case sensitiveness is
also something that is different for different characters in different
languages.
Hope this helps a bit
/Karlsson
Yves Goergen wrote:
Hello,
I've just read through the MySQL documentation about Unicode support,
collations and how it affects sorting and comparison of strings. And I
find it horrible, at least. I feel like I'm back in the MySQL 3.x days
where I used UTF-8 in my application and MySQL treated it binary. The
only problem was incorrect sorting of things. Today we have UTF-8
support in MySQL, which brings correct sorting (for whatever
definition of "correct") but has taken correct comparison again.
When I have three strings, e.g. "Handel", "Händel" and "Hendel", I'd
like to have them sorted correctly. Using the
utf8_{general,unicode}_ci collation seems the only way. Now when I
want the row with "Handel" in it, I'll get two rows back. One of them
is not what I wanted. So strictly, the result is incorrect. The only
way to get this right is using the utf8_bin collation. But this again
makes correct sorting impossible.
It's a nightmare. Why can't I get correct sorting *and* correct (i.e.
precise) comparison in one?
If I cannot even rely on the = operator, what good is a text-storing
database? There even isn't a case-sensitive unicode collation other
than utf8_bin. This means that in every database application that uses
unicode, I cannot separate lower from uppercase when retrieving stuff.
MySQL is simply blind for that. Not to mention different characters
that Unicode, MySQL, DIN, ISO or whoever think are the same, but they
aren't. If they were the same, you wouldn't need both of them.
Finally, my application should really be portable. I haven't looked
into how other DBMS handle it and whether the SQL syntax would be the
same, should there be any method on the language layer to do it right.
I only know that SQLite stores in UTF-8 but otherwise doesn't care
about Unicode, i.e. sorting should be broken, comparison is correct.
PostgreSQL didn't find its own columns again, so I cancelled the test.
--
__ ___ ___ ____ __
/ |/ /_ __/ __/ __ \/ / Anders Karlsson ([EMAIL PROTECTED])
/ /|_/ / // /\ \/ /_/ / /__ MySQL AB, Sales Engineer
/_/ /_/\_, /___/\___\_\___/ Stockholm
<___/ www.mysql.com Cellphone: +46 708 608121
Skype: drdatabase
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]