: It looks like a very promising approach for us. I'm going to implement 
: an custom Tokeniser based on your suggestions and see how it goes. Thank 
: you all for your comments!

you don't really need a custom tokenizer -- just a buffered TokenFilter 
that clones the original token if it contains accent chars, mutates the 
clone, and then emits it next with a positionIncrement of 0.

i'm kind of suprised ISOLatin1AccentFilter doesn't have an option to do 
this already -- it would certianly be a worthy patch to commit if someone 
wants to submit it back to lucene-java.

: > don't match the accents exactly they won't get any hits: e.g. if a word
: > contains two accented characters and the user only accents one of them in
: > their query, they won't match the accented or the unaccented version.

this could be accounted for by generating all of the permuations of 
unaccented characters when indexing -- it wouldn't solve the problem of a 
source term containing only one accent and the user quering with only one 
accent but on a different character ... you could work arround this by 
puting all of the permutations in at index time, but querying on the exact 
term and the no-accent term at query time.


-Hoss

Reply via email to