DM, this isn't a bug. The arabic stopwords are not normalized.
but for persian, i normalized the stopwords. mostly because i did not want to have to create variations with farsi yah versus arabic yah for each one. On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith...@gmail.com> wrote: > I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't know > Arabic or Farsi, but have some texts to index in those languages.) > The tokenizer/filter chain for ArabicAnalyzer is: > TokenStream result = new ArabicLetterTokenizer( reader ); > result = new StopFilter( result, stoptable ); > result = new LowerCaseFilter(result); > result = new ArabicNormalizationFilter( result ); > result = new ArabicStemFilter( result ); > > return result; > > Shouldn't the StopFilter come after ArabicNormalizationFilter? > > > As a comparison the PersianAnalyzer has: > TokenStream result = new ArabicLetterTokenizer(reader); > result = new LowerCaseFilter(result); > result = new ArabicNormalizationFilter(result); > /* additional persian-specific normalization */ > result = new PersianNormalizationFilter(result); > /* > * the order here is important: the stopword list is normalized with > the > * above! > */ > result = new StopFilter(result, stoptable); > > return result; > > > Thanks, > DM > -- Robert Muir rcm...@gmail.com