Basem, I really appreciate your time if you are able to do this. Its been my hope that introducing Arabic/Farsi support will create enough interest to encourage more qualified people to come and really make things nice.
If you don't mind, you can look at http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issue with a patch file to improve our stopwords list. Otherwise, in my opinion a good list is also acceptable and I will volunteer to turn it into a patch :) On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok <nar...@gmail.com> wrote: > Robert, > > I will be happy to do so. Currently, I am testing the new Arabic > analyzer in 2.9, and also I will prepare a new stop word list. I will > provide you with my findings/comments soon. > > Best, > > On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <rcm...@gmail.com> wrote: > > Basem, by any chance would you be willing to help improve it for us? > > > > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <nar...@gmail.com> wrote: > >> > >> DM, there is no upper/lower cases in Arabic, so don't worry, but the > >> stop word list needs some corrections and may miss some common/stop > >> Arabic words. > >> > >> Best, > >> > >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith...@gmail.com> wrote: > >> > Robert, > >> > Thanks for the info. > >> > As I said, I am illiterate in Arabic. So I have another, perhaps > >> > nonsensical, question: > >> > Does the stop word list have every combination of upper/lower case for > >> > each > >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should > it > >> > come > >> > after LowerCaseFilter? > >> > -- DM > >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: > >> > > >> > DM, this isn't a bug. > >> > > >> > The arabic stopwords are not normalized. > >> > > >> > but for persian, i normalized the stopwords. mostly because i did not > >> > want > >> > to have to create variations with farsi yah versus arabic yah for each > >> > one. > >> > > >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith...@gmail.com> > wrote: > >> >> > >> >> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't > >> >> know > >> >> Arabic or Farsi, but have some texts to index in those languages.) > >> >> The tokenizer/filter chain for ArabicAnalyzer is: > >> >> TokenStream result = new ArabicLetterTokenizer( reader ); > >> >> result = new StopFilter( result, stoptable ); > >> >> result = new LowerCaseFilter(result); > >> >> result = new ArabicNormalizationFilter( result ); > >> >> result = new ArabicStemFilter( result ); > >> >> > >> >> return result; > >> >> > >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter? > >> >> > >> >> As a comparison the PersianAnalyzer has: > >> >> TokenStream result = new ArabicLetterTokenizer(reader); > >> >> result = new LowerCaseFilter(result); > >> >> result = new ArabicNormalizationFilter(result); > >> >> /* additional persian-specific normalization */ > >> >> result = new PersianNormalizationFilter(result); > >> >> /* > >> >> * the order here is important: the stopword list is normalized > >> >> with > >> >> the > >> >> * above! > >> >> */ > >> >> result = new StopFilter(result, stoptable); > >> >> > >> >> return result; > >> >> > >> >> > >> >> Thanks, > >> >> DM > >> > > >> > > >> > -- > >> > Robert Muir > >> > rcm...@gmail.com > >> > > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >> > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com