On Jan 13, 8:15 pm, Chris Meller <[email protected]> wrote: > So after reading the MySQL page you referenced in the ticket I decided to do > a few tests to make sure I understood the problem and thought they might be > useful to others to illustrate it. > > I created 3 tables, each with a different encoding: utf8_bin, > utf8_general_ci, and utf8_unicode_ci. Each table has an int ID and a varchar > name field. The varchar field doesn't have a specific character set or > collation specified, so it should inherit the table's. I then loaded up 4 > values into the name field of each table. Two were plain ASCII and two > included one of the disputed characters. > > Apparently the whole point of MySQL's transliteration is to fix sort results. > Only speaking English I really don't understand why something like 'ß' would > be sorted like 'ss', but as long as we understand and accept that sorting by > text fields may vary that's fine... We don't sort by any by default that I'm > aware of anyway. You can compare the results of the different sort orders at > [1]. >
'ss' is the normal tranliteration for 'ß'. > The problem for us comes when you're trying to pick out a single result based > on a text field... like the slug in ticket #1359. The results searching for > 'foo A', which should only return a single row from each table, are at [2]. > Clearly general_ci and unicode_ci are not working as we intend for them to. > > When sorting, MySQL's hidden transliteration (Ä to A) is fine and may yield > more expected results, but this totally breaks things when you're trying to > pick out a single result. A is neither the upper- nor lower-case version of Ä > in any language I'm aware of, so it has nothing to do with being > case-insensitive. Nothing we do has ever cared about case, so being > case-insensitive is fine. Treating multiple totally-different unicode > characters the same, however, is not. > SQLite's default BINARY sort is case-sensitive, so upper case characters are all ordered before lower case characters. Habari doesn't normally order by text fields, so it usually doesn't matter. If we switch to utf8_bin for MySQL, the same sort of ordering would occur, instead of all AaBbCc, etc. as now happens with MySQL, we would get ABCabc with it. > It appears clear that we need to move to utf8_bin as the collation for all > our tables in MySQL. An updated version of the > schema/mysql/upgrades/post/3249.sql script should be all we need to make the > change. > We would need to update the schema for MySQL also, for new installs. > SQLite does appear to only support binary collations and Postgres doesn't > support individual encodings on tables or fields (just on the parent > database, so we're at the mercy of the person creating it), so MySQL should > be the only one we need to worry about (yet again). > To be clear, SQLite does support other collations. BINARY is the default, and the one we use. There is also NOCASE, which imposes a case-insensitive ordering on ASCII characters, but not those outside the ASCII range, and RTRIM, which tells SQLite to ignore trailing spaces in comparisons. Rick > [1]:http://paste.roguecoders.com/p/adb2dfdd539d696a1efbe7175e15771d.txt > [2]:http://paste.roguecoders.com/p/d0c418fdaae0868eb4de609fe7e65878.txt > > On Jan 13, 2011, at 11:13 AM, rick c wrote: > > > > > > > > > Character set collations are a complex subject that has bitten Habari > > in the past and has returned again in the form of how the umlaut is > > dealt with, as noted in ticket #1359 > > (https://trac.habariproject.org/habari/ticket/1359 > > ). > > > This issues occurs because Habari uses the utf8_unicode_ci collation > > when creating MySQL tables. This collation is case insensitive, and > > doesn't differentiate between characters that have an umlaut, and the > > same character without the umlaut. I don't know if other accents are > > affected or not. > > > There isn't an issue with SQLite, which, by default, uses a binary > > collation. It appears to be case sensitive, and differentiates > > characters with umlauts from those without. > > > My question is, would we be better off using utf8_bin for the MySQL > > collation? As best I have been able to find out, it would give results > > similar to what we see in SQLite when comparing strings. > > > -- > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group > > athttp://groups.google.com/group/habari-dev -- To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/habari-dev
