Re: Searching on strings which might contain special characters.

Michiel Meeuwissen Wed, 03 Mar 2004 06:33:01 -0800

Henk Hangyi <[EMAIL PROTECTED]> wrote:
> When i search on field in MMBase, i would like to take the special
> characters into account. For instance when i search on "Yes" i would
> also like to find "Y?s", "Y?s", "Y?s" and "Y?s".
> 
> My first approach was to search on "Y_s". This does nor work, since ?,
> ?, ? and ? are represented in the database by a two character sequence.
> So in order to find "Y?s", "Y?s", "Y?s" and "Y?s", i had to search on
> "Y__s" (= Y,underscore,underscore,s ).


It highly depends on which database you use. Actually I do not think
'special' characters are represented by two character sequences, but by
UTF-8 sequences (assuming that you use mysql) (which are 1 or more bytes;
often, but not always, 2 if not 1).


> When tryin to find "Yes" and all of its mutations, one has to search
> both on "Yes" and "Y__s". For one word this solution becomes already
> cumbersome. For instance consider searching on "development" one has to
> use: developement, d__velopement, dev__lopement, etc, etc (= 32
> possibilities).

It is actually nearly impossible, because also the consonants could
contain accents, like for example c-cedile, s-hacek or 'polish' l.

I'm not quite sure what _ means, but for mysql you could consider using
regular expressions.

So, if you want to completely ignore vocals (and only require presence)

d.+v.+l.+p.+m.+nt 

or so.

Support for regular expression matching is not yet present in 1.7, but I
have implemented it already.


Anyway, actually it is a problem of the database. Mysql has actually no
support for UTF-8, and I honestly also don't know how good is its support
for ISO-8859-1 in this respect.

Postgresql supports UTF-8, but I also don't know how it supports
searching/sorting, because for that you _need_ to know not only the encoding
but also the Locale and as far as I know you cannot configure that. It
should e.g. somehow know that e and \'e should be considered the same letter
or not.

Java, however has excellent support for unicode (Collators and so), but I
would not know how to use it here :-)

> Did anybody tried to solve the same problem? Does anybody has a
> suggestion?

I have never tried to solve it in MMBase. I have once solved it in a
perl/mysql applications which involved storing everything twice (once
completely stripped from accents to search in, and once the real string).
Field-Types project might provide the means to implement a similar hack
already.

I think the actual solution might lay in the realm of full-text-search
engines (I suppose their indices ignore accents, in a similar fashion),
which are available for mysql and postgresql but for which as yet all
support is lacking in MMBase.

 Michiel


-- 
Michiel Meeuwissen 
Mediapark C101 Hilversum  
+31 (0)35 6772979
nl_NL eo_XX en_US
mihxil'
 [] ()

Re: Searching on strings which might contain special characters.

Reply via email to