Re: umlaut normalisation

Markus Spath Tue, 27 Jan 2004 12:07:28 -0800

Thomas Scheffler wrote:

So there is one query part with the WhiteSpaceAnalyzer and the other with
GermanAnalyzer. But I dont' know why H�hnerstall get's to huhnerstall.

its because the GermanAnalyzer applies the GermanStemmer (via the GermanStemFilter) which substitutes umlauts with their non-umlaut conterparts

from org.apache.lucene.analysis.de.GermanStemmer :

    /**
     * Do some substitutions for the term to reduce overstemming:
     *
     * - Substitute Umlauts with their corresponding vowel: ��� -> aou,
     *   "�" is substituted by "ss"
     * - Substitute a second char of an pair of equal characters with
     *   an asterisk: ?? -> ?*
     * - Substitute some common character combinations with a token:
     *   sch/ch/ei/ie/ig/st -> $/�/%/&/#/!
     *
     * @return  The term with all needed substitutions.
     */
    private StringBuffer substitute( StringBuffer buffer ) {
    ...

if you want your Analyzer to produce Tokens like 'huehnerstall' probably the easiest option is to start with the GermanAnalyzer and add a UmlautFilter before the GermanStemFilter is applied.

markus


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: umlaut normalisation

Reply via email to