Ok, the list is ready (initial one, as I will continue enhancing it). I will create JIRA issue and send the patch.
Also, I have some small changes to the normalization (e.g. removing some diacritics, and other changes) Best, Basem On Thu, Oct 8, 2009 at 8:51 PM, Robert Muir <rcm...@gmail.com> wrote: > Basem, I really appreciate your time if you are able to do this. > > Its been my hope that introducing Arabic/Farsi support will create enough > interest to encourage more qualified people to come and really make things > nice. > > If you don't mind, you can look at > http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issue > with a patch file to improve our stopwords list. > > Otherwise, in my opinion a good list is also acceptable and I will volunteer > to turn it into a patch :) > > On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok <nar...@gmail.com> wrote: >> >> Robert, >> >> I will be happy to do so. Currently, I am testing the new Arabic >> analyzer in 2.9, and also I will prepare a new stop word list. I will >> provide you with my findings/comments soon. >> >> Best, >> >> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <rcm...@gmail.com> wrote: >> > Basem, by any chance would you be willing to help improve it for us? >> > >> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <nar...@gmail.com> wrote: >> >> >> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the >> >> stop word list needs some corrections and may miss some common/stop >> >> Arabic words. >> >> >> >> Best, >> >> >> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <dmsmith...@gmail.com> wrote: >> >> > Robert, >> >> > Thanks for the info. >> >> > As I said, I am illiterate in Arabic. So I have another, perhaps >> >> > nonsensical, question: >> >> > Does the stop word list have every combination of upper/lower case >> >> > for >> >> > each >> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should >> >> > it >> >> > come >> >> > after LowerCaseFilter? >> >> > -- DM >> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >> >> > >> >> > DM, this isn't a bug. >> >> > >> >> > The arabic stopwords are not normalized. >> >> > >> >> > but for persian, i normalized the stopwords. mostly because i did not >> >> > want >> >> > to have to create variations with farsi yah versus arabic yah for >> >> > each >> >> > one. >> >> > >> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <dmsmith...@gmail.com> >> >> > wrote: >> >> >> >> >> >> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't >> >> >> know >> >> >> Arabic or Farsi, but have some texts to index in those languages.) >> >> >> The tokenizer/filter chain for ArabicAnalyzer is: >> >> >> TokenStream result = new ArabicLetterTokenizer( reader ); >> >> >> result = new StopFilter( result, stoptable ); >> >> >> result = new LowerCaseFilter(result); >> >> >> result = new ArabicNormalizationFilter( result ); >> >> >> result = new ArabicStemFilter( result ); >> >> >> >> >> >> return result; >> >> >> >> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter? >> >> >> >> >> >> As a comparison the PersianAnalyzer has: >> >> >> TokenStream result = new ArabicLetterTokenizer(reader); >> >> >> result = new LowerCaseFilter(result); >> >> >> result = new ArabicNormalizationFilter(result); >> >> >> /* additional persian-specific normalization */ >> >> >> result = new PersianNormalizationFilter(result); >> >> >> /* >> >> >> * the order here is important: the stopword list is normalized >> >> >> with >> >> >> the >> >> >> * above! >> >> >> */ >> >> >> result = new StopFilter(result, stoptable); >> >> >> >> >> >> return result; >> >> >> >> >> >> >> >> >> Thanks, >> >> >> DM >> >> > >> >> > >> >> > -- >> >> > Robert Muir >> >> > rcm...@gmail.com >> >> > >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> > >> > >> > >> > -- >> > Robert Muir >> > rcm...@gmail.com >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > > > -- > Robert Muir > rcm...@gmail.com > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org