I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't
know Arabic or Farsi, but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
TokenStream result = new ArabicLetterTokenizer( reader );
result = new StopFilter( result, stoptable );
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter( result );
result = new ArabicStemFilter( result );
return result;
Shouldn't the StopFilter come after ArabicNormalizationFilter?
As a comparison the PersianAnalyzer has:
TokenStream result = new ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(result);
/* additional persian-specific normalization */
result = new PersianNormalizationFilter(result);
/*
* the order here is important: the stopword list is normalized
with the
* above!
*/
result = new StopFilter(result, stoptable);
return result;
Thanks,
DM