> > I'm suggesting that if I know my input document well and know that it has > mixed text and that the text is Arabic and one other known language that I > might want to augment the stop list with stop words appropriate for that > known language. I think that in this case, stop filter should be after lower > case filter. > I think this is a good idea?
> > As to lower casing across the board, I also think it is pretty safe. But I > think there are some edge cases. For example, lowercasing a Greek word in > all upper case ending in sigma will not produce the same as lower casing the > same Greek word in all lower case. The Greek word should have a final sigma > rather than a small sigma. For Greek, using an UpperCaseFilter followed by a > LowerCaseFilter would handle this case. > or you could use unicode case folding. lowercasing is for display purposes, not search. > > IMHO, this is not an issue for the Arabic or Persian analyzers. > > -- DM > > > On 10/08/2009 09:36 AM, Robert Muir wrote: > > DM, i suppose. but this is a tricky subject, what if you have mixed Arabic > / German or something like that? > > for some other languages written in the Latin script, English stopwords > could be bad :) > > I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe > across the board though. > > On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com> wrote: > >> On 10/08/2009 09:23 AM, Uwe Schindler wrote: >> >>> Just an addition: The lowercase filter is only for the case of embedded >>> non-arabic words. And these will not appear in the stop words. >>> >>> >> I learned something new! >> >> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to >> augment the stopwords list with English stop words? And if so, shouldn't the >> stop filter come after the lower case filter? >> >> -- DM >> >> -----Original Message----- >>>> From: Basem Narmok [mailto:nar...@gmail.com] >>>> Sent: Thursday, October 08, 2009 4:20 PM >>>> To: java-dev@lucene.apache.org >>>> Subject: Re: Arabic Analyzer: possible bug >>>> >>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the >>>> stop word list needs some corrections and may miss some common/stop >>>> Arabic words. >>>> >>>> Best, >>>> >>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith...@gmail.com> wrote: >>>> >>>> >>>>> Robert, >>>>> Thanks for the info. >>>>> As I said, I am illiterate in Arabic. So I have another, perhaps >>>>> nonsensical, question: >>>>> Does the stop word list have every combination of upper/lower case for >>>>> >>>>> >>>> each >>>> >>>> >>>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should it >>>>> >>>>> >>>> come >>>> >>>> >>>>> after LowerCaseFilter? >>>>> -- DM >>>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >>>>> >>>>> DM, this isn't a bug. >>>>> >>>>> The arabic stopwords are not normalized. >>>>> >>>>> but for persian, i normalized the stopwords. mostly because i did not >>>>> >>>>> >>>> want >>>> >>>> >>>>> to have to create variations with farsi yah versus arabic yah for each >>>>> >>>>> >>>> one. >>>> >>>> >>>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith...@gmail.com> wrote: >>>>> >>>>> >>>>>> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't >>>>>> >>>>>> >>>>> know >>>> >>>> >>>>> Arabic or Farsi, but have some texts to index in those languages.) >>>>>> The tokenizer/filter chain for ArabicAnalyzer is: >>>>>> TokenStream result = new ArabicLetterTokenizer( reader ); >>>>>> result = new StopFilter( result, stoptable ); >>>>>> result = new LowerCaseFilter(result); >>>>>> result = new ArabicNormalizationFilter( result ); >>>>>> result = new ArabicStemFilter( result ); >>>>>> >>>>>> return result; >>>>>> >>>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter? >>>>>> >>>>>> As a comparison the PersianAnalyzer has: >>>>>> TokenStream result = new ArabicLetterTokenizer(reader); >>>>>> result = new LowerCaseFilter(result); >>>>>> result = new ArabicNormalizationFilter(result); >>>>>> /* additional persian-specific normalization */ >>>>>> result = new PersianNormalizationFilter(result); >>>>>> /* >>>>>> * the order here is important: the stopword list is normalized >>>>>> >>>>>> >>>>> with >>>> >>>> >>>>> the >>>>>> * above! >>>>>> */ >>>>>> result = new StopFilter(result, stoptable); >>>>>> >>>>>> return result; >>>>>> >>>>>> >>>>>> Thanks, >>>>>> DM >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Robert Muir >>>>> rcm...@gmail.com >>>>> >>>> > -- Robert Muir rcm...@gmail.com