Henk Hangyi <[EMAIL PROTECTED]> wrote: > When i search on field in MMBase, i would like to take the special > characters into account. For instance when i search on "Yes" i would > also like to find "Y?s", "Y?s", "Y?s" and "Y?s". > > My first approach was to search on "Y_s". This does nor work, since ?, > ?, ? and ? are represented in the database by a two character sequence. > So in order to find "Y?s", "Y?s", "Y?s" and "Y?s", i had to search on > "Y__s" (= Y,underscore,underscore,s ).
It highly depends on which database you use. Actually I do not think 'special' characters are represented by two character sequences, but by UTF-8 sequences (assuming that you use mysql) (which are 1 or more bytes; often, but not always, 2 if not 1). > When tryin to find "Yes" and all of its mutations, one has to search > both on "Yes" and "Y__s". For one word this solution becomes already > cumbersome. For instance consider searching on "development" one has to > use: developement, d__velopement, dev__lopement, etc, etc (= 32 > possibilities). It is actually nearly impossible, because also the consonants could contain accents, like for example c-cedile, s-hacek or 'polish' l. I'm not quite sure what _ means, but for mysql you could consider using regular expressions. So, if you want to completely ignore vocals (and only require presence) d.+v.+l.+p.+m.+nt or so. Support for regular expression matching is not yet present in 1.7, but I have implemented it already. Anyway, actually it is a problem of the database. Mysql has actually no support for UTF-8, and I honestly also don't know how good is its support for ISO-8859-1 in this respect. Postgresql supports UTF-8, but I also don't know how it supports searching/sorting, because for that you _need_ to know not only the encoding but also the Locale and as far as I know you cannot configure that. It should e.g. somehow know that e and \'e should be considered the same letter or not. Java, however has excellent support for unicode (Collators and so), but I would not know how to use it here :-) > Did anybody tried to solve the same problem? Does anybody has a > suggestion? I have never tried to solve it in MMBase. I have once solved it in a perl/mysql applications which involved storing everything twice (once completely stripped from accents to search in, and once the real string). Field-Types project might provide the means to implement a similar hack already. I think the actual solution might lay in the realm of full-text-search engines (I suppose their indices ignore accents, in a similar fashion), which are available for mysql and postgresql but for which as yet all support is lacking in MMBase. Michiel -- Michiel Meeuwissen Mediapark C101 Hilversum +31 (0)35 6772979 nl_NL eo_XX en_US mihxil' [] ()
