On Tue, 16 Nov 2010, Brooks, Travis C. wrote: > u and ü are treated the same
This is due to strip_accents() that handles one-to-one mapping only. > Would it not make sense to have ue <-> u <-> ü as well. I know we've > talked about this before, but can someone remind me what our options > are here? This requires to put a kind of synonym expansion around get_words_from_foo() family of functions in the indexer so that one index term could generate several. This is both useful to have and straightforward to implement. However, we should muse some more about how far we would like to go here. E.g. the direction `ue' -> `u', `ü' should not be automatic, since it would not play nicely for words like `cruel'. If we want to keep it simple, then I think we should support only one direction, `ü' -> `u', `ue', which would seem reasonable to do regardless of the concrete metadata field and/or concrete language. We could assemble a few other language-independent expansions of this kind and plug them into the indexer as mentioned above. (Every index could be configured to use different set of expansions, or none; like with stemming.) Alternatively, we can try to be more fancy and attempt some language-specific analysis and treatment, so depending on the language of the document and/or of the field used, we would do various stuff to the text. I think the former should be probably sufficient. WDYT? Best regards -- Tibor Simko
