[habari-dev] Re: Should Habari use utf8_bin or utf8_unicode_ci as Its MySQL collation

rick c Thu, 13 Jan 2011 18:01:19 -0800


On Jan 13, 8:15 pm, Chris Meller <[email protected]> wrote:
> So after reading the MySQL page you referenced in the ticket I decided to do 
> a few tests to make sure I understood the problem and thought they might be 
> useful to others to illustrate it.
>
> I created 3 tables, each with a different encoding: utf8_bin, 
> utf8_general_ci, and utf8_unicode_ci. Each table has an int ID and a varchar 
> name field. The varchar field doesn't have a specific character set or 
> collation specified, so it should inherit the table's. I then loaded up 4 
> values into the name field of each table. Two were plain ASCII and two 
> included one of the disputed characters.
>
> Apparently the whole point of MySQL's transliteration is to fix sort results. 
> Only speaking English I really don't understand why something like 'ß' would 
> be sorted like 'ss', but as long as we understand and accept that sorting by 
> text fields may vary that's fine... We don't sort by any by default that I'm 
> aware of anyway. You can compare the results of the different sort orders at 
> [1].
>


'ss' is the normal tranliteration for 'ß'.

> The problem for us comes when you're trying to pick out a single result based 
> on a text field... like the slug in ticket #1359. The results searching for 
> 'foo A', which should only return a single row from each table, are at [2]. 
> Clearly general_ci and unicode_ci are not working as we intend for them to.
>
> When sorting, MySQL's hidden transliteration (Ä to A) is fine and may yield 
> more expected results, but this totally breaks things when you're trying to 
> pick out a single result. A is neither the upper- nor lower-case version of Ä 
> in any language I'm aware of, so it has nothing to do with being 
> case-insensitive. Nothing we do has ever cared about case, so being 
> case-insensitive is fine. Treating multiple totally-different unicode 
> characters the same, however, is not.
>

SQLite's default BINARY sort is case-sensitive, so upper case
characters are all ordered before lower case characters. Habari
doesn't normally order by text fields, so it usually doesn't matter.
If we switch to utf8_bin for MySQL, the same sort of ordering would
occur, instead of all AaBbCc, etc. as now happens with MySQL, we would
get ABCabc with it.

> It appears clear that we need to move to utf8_bin as the collation for all 
> our tables in MySQL. An updated version of the 
> schema/mysql/upgrades/post/3249.sql script should be all we need to make the 
> change.
>

We would need to update the schema for MySQL also, for new installs.

> SQLite does appear to only support binary collations and Postgres doesn't 
> support individual encodings on tables or fields (just on the parent 
> database, so we're at the mercy of the person creating it), so MySQL should 
> be the only one we need to worry about (yet again).
>

To be clear, SQLite does support other collations. BINARY is the
default, and the one we use. There is also NOCASE, which imposes a
case-insensitive ordering on ASCII characters, but not those outside
the ASCII range, and RTRIM, which tells SQLite to ignore trailing
spaces in comparisons.

Rick

> [1]:http://paste.roguecoders.com/p/adb2dfdd539d696a1efbe7175e15771d.txt
> [2]:http://paste.roguecoders.com/p/d0c418fdaae0868eb4de609fe7e65878.txt
>
> On Jan 13, 2011, at 11:13 AM, rick c wrote:
>
>
>
>
>
>
>
> > Character set collations are a complex subject that has bitten Habari
> > in the past and has returned again in the form of how the umlaut is
> > dealt with, as noted in ticket #1359 
> > (https://trac.habariproject.org/habari/ticket/1359
> > ).
>
> > This issues occurs because Habari uses the utf8_unicode_ci collation
> > when creating MySQL tables. This collation is case insensitive, and
> > doesn't differentiate between characters that have an umlaut, and the
> > same character without the umlaut. I don't know if other accents are
> > affected or not.
>
> > There isn't an issue with SQLite, which, by default, uses a binary
> > collation. It appears to be case sensitive, and differentiates
> > characters with umlauts from those without.
>
> > My question is, would we be better off using utf8_bin for the MySQL
> > collation? As best I have been able to find out, it would give results
> > similar to what we see in SQLite when comparing strings.
>
> > --
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to 
> > [email protected]
> > For more options, visit this group 
> > athttp://groups.google.com/group/habari-dev

-- 
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev

[habari-dev] Re: Should Habari use utf8_bin or utf8_unicode_ci as Its MySQL collation

Reply via email to