Re: REGEXP and unicode weirdness

fsb Thu, 21 Jan 2010 13:13:26 -0800

On 1/21/10 10:27 AM, "John Campbell" <jcampbe...@gmail.com> wrote:


> I want to find rows that contain a word that matches a term, accent
> insensitive:  I am using utf8-general collation everywhere.
> 
> attempt 1:
> SELECT * FROM t WHERE txt LIKE '%que%'
> Matches que qué, but also matches 'queue'
> 
> attempt 1.5:
> SELECT * FROM t WHERE txt LIKE '% que %' OR LIKE 'que %' OR LIKE '% que';
> Almost, but misses "que!"  or 'que...'
> 
> attempt2:
> SELECT * FROM t WHERE txt REGEXP '[[:<:]]que[[:>:]]'
> Matches que, not queue, but doesn't match qué.
> 
> attempt3
> SELECT * FROM t WHERE txt REGEXP
> '[[:<:]]q[uùúûüũūŭůűųǔǖǘǚǜ][eèéêëēĕėęě][[:>:]]'
> Matches que, queue, qué.  (I have no idea why this matches queue, but
> the Regex behavior is bizarre with unicode.)
> 
> Does anyone know why the final regex acts weird?

"Warning

"The REGEXP and RLIKE operators work in byte-wise fashion, so they are not
multi-byte safe and may produce unexpected results with multi-byte character
sets. In addition, these operators compare characters by their byte values
and accented characters may not compare as equal even if a given collation
treats them as equal." -- Mysql 11.4.2


> It there a good solution?

doesn't look like it.

Sphinxsearch might work nicely for you (it does for me) but that may not be
an option for you. i generated a Sphinxsearch charset_table config that
mimics utf8_general_ci collation.



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

Re: REGEXP and unicode weirdness

Reply via email to