Robert, I will be happy to do so. Currently, I am testing the new Arabic analyzer in 2.9, and also I will prepare a new stop word list. I will provide you with my findings/comments soon.
Best, On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <rcm...@gmail.com> wrote: > Basem, by any chance would you be willing to help improve it for us? > > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <nar...@gmail.com> wrote: >> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the >> stop word list needs some corrections and may miss some common/stop >> Arabic words. >> >> Best, >> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith...@gmail.com> wrote: >> > Robert, >> > Thanks for the info. >> > As I said, I am illiterate in Arabic. So I have another, perhaps >> > nonsensical, question: >> > Does the stop word list have every combination of upper/lower case for >> > each >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should it >> > come >> > after LowerCaseFilter? >> > -- DM >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >> > >> > DM, this isn't a bug. >> > >> > The arabic stopwords are not normalized. >> > >> > but for persian, i normalized the stopwords. mostly because i did not >> > want >> > to have to create variations with farsi yah versus arabic yah for each >> > one. >> > >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith...@gmail.com> wrote: >> >> >> >> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't >> >> know >> >> Arabic or Farsi, but have some texts to index in those languages.) >> >> The tokenizer/filter chain for ArabicAnalyzer is: >> >> TokenStream result = new ArabicLetterTokenizer( reader ); >> >> result = new StopFilter( result, stoptable ); >> >> result = new LowerCaseFilter(result); >> >> result = new ArabicNormalizationFilter( result ); >> >> result = new ArabicStemFilter( result ); >> >> >> >> return result; >> >> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter? >> >> >> >> As a comparison the PersianAnalyzer has: >> >> TokenStream result = new ArabicLetterTokenizer(reader); >> >> result = new LowerCaseFilter(result); >> >> result = new ArabicNormalizationFilter(result); >> >> /* additional persian-specific normalization */ >> >> result = new PersianNormalizationFilter(result); >> >> /* >> >> * the order here is important: the stopword list is normalized >> >> with >> >> the >> >> * above! >> >> */ >> >> result = new StopFilter(result, stoptable); >> >> >> >> return result; >> >> >> >> >> >> Thanks, >> >> DM >> > >> > >> > -- >> > Robert Muir >> > rcm...@gmail.com >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org