There is no upper and lower case in Arabic. --- On Thu, 10/8/09, DM Smith <dmsmith...@gmail.com> wrote:
From: DM Smith <dmsmith...@gmail.com> Subject: Re: Arabic Analyzer: possible bug To: java-dev@lucene.apache.org Date: Thursday, October 8, 2009, 3:14 PM Robert,Thanks for the info.As I said, I am illiterate in Arabic. So I have another, perhaps nonsensical, question:Does the stop word list have every combination of upper/lower case for each Arabic word in the list? (i.e. is it fully de-normalized?) Or should it come after LowerCaseFilter? -- DM On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: DM, this isn't a bug. The arabic stopwords are not normalized. but for persian, i normalized the stopwords. mostly because i did not want to have to create variations with farsi yah versus arabic yah for each one. On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith...@gmail.com> wrote: I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't know Arabic or Farsi, but have some texts to index in those languages.) The tokenizer/filter chain for ArabicAnalyzer is: TokenStream result = new ArabicLetterTokenizer( reader ); result = new StopFilter( result, stoptable ); result = new LowerCaseFilter(result); result = new ArabicNormalizationFilter( result ); result = new ArabicStemFilter( result ); return result; Shouldn't the StopFilter come after ArabicNormalizationFilter? As a comparison the PersianAnalyzer has: TokenStream result = new ArabicLetterTokenizer(reader); result = new LowerCaseFilter(result); result = new ArabicNormalizationFilter(result); /* additional persian-specific normalization */ result = new PersianNormalizationFilter(result); /* * the order here is important: the stopword list is normalized with the * above! */ result = new StopFilter(result, stoptable); return result; Thanks, DM -- Robert Muir rcm...@gmail.com