DM, i suppose. but this is a tricky subject, what if you have mixed Arabic / German or something like that?
for some other languages written in the Latin script, English stopwords could be bad :) I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe across the board though. On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com> wrote: > On 10/08/2009 09:23 AM, Uwe Schindler wrote: > >> Just an addition: The lowercase filter is only for the case of embedded >> non-arabic words. And these will not appear in the stop words. >> >> > I learned something new! > > Hmm. If one has a mixed Arabic / English text, shouldn't one be able to > augment the stopwords list with English stop words? And if so, shouldn't the > stop filter come after the lower case filter? > > -- DM > > > -----Original Message----- >>> From: Basem Narmok [mailto:nar...@gmail.com] >>> Sent: Thursday, October 08, 2009 4:20 PM >>> To: java-dev@lucene.apache.org >>> Subject: Re: Arabic Analyzer: possible bug >>> >>> DM, there is no upper/lower cases in Arabic, so don't worry, but the >>> stop word list needs some corrections and may miss some common/stop >>> Arabic words. >>> >>> Best, >>> >>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith...@gmail.com> wrote: >>> >>> >>>> Robert, >>>> Thanks for the info. >>>> As I said, I am illiterate in Arabic. So I have another, perhaps >>>> nonsensical, question: >>>> Does the stop word list have every combination of upper/lower case for >>>> >>>> >>> each >>> >>> >>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it >>>> >>>> >>> come >>> >>> >>>> after LowerCaseFilter? >>>> -- DM >>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >>>> >>>> DM, this isn't a bug. >>>> >>>> The arabic stopwords are not normalized. >>>> >>>> but for persian, i normalized the stopwords. mostly because i did not >>>> >>>> >>> want >>> >>> >>>> to have to create variations with farsi yah versus arabic yah for each >>>> >>>> >>> one. >>> >>> >>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith...@gmail.com> wrote: >>>> >>>> >>>>> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't >>>>> >>>>> >>>> know >>> >>> >>>> Arabic or Farsi, but have some texts to index in those languages.) >>>>> The tokenizer/filter chain for ArabicAnalyzer is: >>>>> TokenStream result = new ArabicLetterTokenizer( reader ); >>>>> result = new StopFilter( result, stoptable ); >>>>> result = new LowerCaseFilter(result); >>>>> result = new ArabicNormalizationFilter( result ); >>>>> result = new ArabicStemFilter( result ); >>>>> >>>>> return result; >>>>> >>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter? >>>>> >>>>> As a comparison the PersianAnalyzer has: >>>>> TokenStream result = new ArabicLetterTokenizer(reader); >>>>> result = new LowerCaseFilter(result); >>>>> result = new ArabicNormalizationFilter(result); >>>>> /* additional persian-specific normalization */ >>>>> result = new PersianNormalizationFilter(result); >>>>> /* >>>>> * the order here is important: the stopword list is normalized >>>>> >>>>> >>>> with >>> >>> >>>> the >>>>> * above! >>>>> */ >>>>> result = new StopFilter(result, stoptable); >>>>> >>>>> return result; >>>>> >>>>> >>>>> Thanks, >>>>> DM >>>>> >>>>> >>>> >>>> -- >>>> Robert Muir >>>> rcm...@gmail.com >>>> >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com