I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't know Arabic or Farsi, but have some texts to index in those languages.)

The tokenizer/filter chain for ArabicAnalyzer is:
        TokenStream result = new ArabicLetterTokenizer( reader );
        result = new StopFilter( result, stoptable );
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter( result );
        result = new ArabicStemFilter( result );

        return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


As a comparison the PersianAnalyzer has:
    TokenStream result = new ArabicLetterTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new ArabicNormalizationFilter(result);
    /* additional persian-specific normalization */
    result = new PersianNormalizationFilter(result);
    /*
* the order here is important: the stopword list is normalized with the
     * above!
     */
    result = new StopFilter(result, stoptable);

    return result;


Thanks,
        DM

Reply via email to