Re: Umlauts et al.

Tibor Simko Tue, 16 Nov 2010 19:56:35 +0100

On Tue, 16 Nov 2010, Brooks, Travis C. wrote:
> u and ü are treated the same


This is due to strip_accents() that handles one-to-one mapping only.

> Would it not make sense to have ue <-> u <-> ü as well.  I know we've
> talked about this before, but can someone remind me what our options
> are here?

This requires to put a kind of synonym expansion around
get_words_from_foo() family of functions in the indexer so that one
index term could generate several.  This is both useful to have and
straightforward to implement.  However, we should muse some more about
how far we would like to go here.  E.g. the direction `ue' -> `u', `ü'
should not be automatic, since it would not play nicely for words like
`cruel'.  If we want to keep it simple, then I think we should support
only one direction, `ü' -> `u', `ue', which would seem reasonable to do
regardless of the concrete metadata field and/or concrete language.  We
could assemble a few other language-independent expansions of this kind
and plug them into the indexer as mentioned above.  (Every index could
be configured to use different set of expansions, or none; like with
stemming.)  

Alternatively, we can try to be more fancy and attempt some
language-specific analysis and treatment, so depending on the language
of the document and/or of the field used, we would do various stuff to
the text.

I think the former should be probably sufficient.  WDYT?

Best regards
-- 
Tibor Simko

Re: Umlauts et al.

Reply via email to