So after reading the MySQL page you referenced in the ticket I decided to do a 
few tests to make sure I understood the problem and thought they might be 
useful to others to illustrate it.

I created 3 tables, each with a different encoding: utf8_bin, utf8_general_ci, 
and utf8_unicode_ci. Each table has an int ID and a varchar name field. The 
varchar field doesn't have a specific character set or collation specified, so 
it should inherit the table's. I then loaded up 4 values into the name field of 
each table. Two were plain ASCII and two included one of the disputed 
characters.

Apparently the whole point of MySQL's transliteration is to fix sort results. 
Only speaking English I really don't understand why something like 'ß' would be 
sorted like 'ss', but as long as we understand and accept that sorting by text 
fields may vary that's fine... We don't sort by any by default that I'm aware 
of anyway. You can compare the results of the different sort orders at [1].

The problem for us comes when you're trying to pick out a single result based 
on a text field... like the slug in ticket #1359. The results searching for 
'foo A', which should only return a single row from each table, are at [2]. 
Clearly general_ci and unicode_ci are not working as we intend for them to.

When sorting, MySQL's hidden transliteration (Ä to A) is fine and may yield 
more expected results, but this totally breaks things when you're trying to 
pick out a single result. A is neither the upper- nor lower-case version of Ä 
in any language I'm aware of, so it has nothing to do with being 
case-insensitive. Nothing we do has ever cared about case, so being 
case-insensitive is fine. Treating multiple totally-different unicode 
characters the same, however, is not.

It appears clear that we need to move to utf8_bin as the collation for all our 
tables in MySQL. An updated version of the schema/mysql/upgrades/post/3249.sql 
script should be all we need to make the change.

SQLite does appear to only support binary collations and Postgres doesn't 
support individual encodings on tables or fields (just on the parent database, 
so we're at the mercy of the person creating it), so MySQL should be the only 
one we need to worry about (yet again).

[1]: http://paste.roguecoders.com/p/adb2dfdd539d696a1efbe7175e15771d.txt
[2]: http://paste.roguecoders.com/p/d0c418fdaae0868eb4de609fe7e65878.txt

On Jan 13, 2011, at 11:13 AM, rick c wrote:

> Character set collations are a complex subject that has bitten Habari
> in the past and has returned again in the form of how the umlaut is
> dealt with, as noted in ticket #1359 ( 
> https://trac.habariproject.org/habari/ticket/1359
> ).
> 
> This issues occurs because Habari uses the utf8_unicode_ci collation
> when creating MySQL tables. This collation is case insensitive, and
> doesn't differentiate between characters that have an umlaut, and the
> same character without the umlaut. I don't know if other accents are
> affected or not.
> 
> There isn't an issue with SQLite, which, by default, uses a binary
> collation. It appears to be case sensitive, and differentiates
> characters with umlauts from those without.
> 
> My question is, would we be better off using utf8_bin for the MySQL
> collation? As best I have been able to find out, it would give results
> similar to what we see in SQLite when comparing strings.
> 
> -- 
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to 
> [email protected]
> For more options, visit this group at 
> http://groups.google.com/group/habari-dev

-- 
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev

Reply via email to