DM by the way, if you want this lowercasing behavior with edge cases, check out LUCENE-1488. There is a case folding filter there, as well as a normalization filter, and they interact correctly for what you want :)
its my understanding that contrib/analyzers should not have any external dependencies, so it could be eons before the jdk exposes these things, so I don't know what to do. It would be nice if things like ArabicAnalyzer handled greek edge cases correctly, don't you think? On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir <rcm...@gmail.com> wrote: > I'm suggesting that if I know my input document well and know that it has >> mixed text and that the text is Arabic and one other known language that I >> might want to augment the stop list with stop words appropriate for that >> known language. I think that in this case, stop filter should be after lower >> case filter. >> > I think this is a good idea? > >> >> As to lower casing across the board, I also think it is pretty safe. But I >> think there are some edge cases. For example, lowercasing a Greek word in >> all upper case ending in sigma will not produce the same as lower casing the >> same Greek word in all lower case. The Greek word should have a final sigma >> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a >> LowerCaseFilter would handle this case. >> > or you could use unicode case folding. lowercasing is for display purposes, > not search. > >> >> IMHO, this is not an issue for the Arabic or Persian analyzers. >> >> -- DM >> >> >> On 10/08/2009 09:36 AM, Robert Muir wrote: >> >> DM, i suppose. but this is a tricky subject, what if you have mixed Arabic >> / German or something like that? >> >> for some other languages written in the Latin script, English stopwords >> could be bad :) >> >> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe >> across the board though. >> >> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com> wrote: >> >>> On 10/08/2009 09:23 AM, Uwe Schindler wrote: >>> >>>> Just an addition: The lowercase filter is only for the case of embedded >>>> non-arabic words. And these will not appear in the stop words. >>>> >>>> >>> I learned something new! >>> >>> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to >>> augment the stopwords list with English stop words? And if so, shouldn't the >>> stop filter come after the lower case filter? >>> >>> -- DM >>> >>> -----Original Message----- >>>>> From: Basem Narmok [mailto:nar...@gmail.com] >>>>> Sent: Thursday, October 08, 2009 4:20 PM >>>>> To: java-dev@lucene.apache.org >>>>> Subject: Re: Arabic Analyzer: possible bug >>>>> >>>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the >>>>> stop word list needs some corrections and may miss some common/stop >>>>> Arabic words. >>>>> >>>>> Best, >>>>> >>>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith...@gmail.com> wrote: >>>>> >>>>> >>>>>> Robert, >>>>>> Thanks for the info. >>>>>> As I said, I am illiterate in Arabic. So I have another, perhaps >>>>>> nonsensical, question: >>>>>> Does the stop word list have every combination of upper/lower case for >>>>>> >>>>>> >>>>> each >>>>> >>>>> >>>>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should >>>>>> it >>>>>> >>>>>> >>>>> come >>>>> >>>>> >>>>>> after LowerCaseFilter? >>>>>> -- DM >>>>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >>>>>> >>>>>> DM, this isn't a bug. >>>>>> >>>>>> The arabic stopwords are not normalized. >>>>>> >>>>>> but for persian, i normalized the stopwords. mostly because i did not >>>>>> >>>>>> >>>>> want >>>>> >>>>> >>>>>> to have to create variations with farsi yah versus arabic yah for each >>>>>> >>>>>> >>>>> one. >>>>> >>>>> >>>>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>>> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't >>>>>>> >>>>>>> >>>>>> know >>>>> >>>>> >>>>>> Arabic or Farsi, but have some texts to index in those languages.) >>>>>>> The tokenizer/filter chain for ArabicAnalyzer is: >>>>>>> TokenStream result = new ArabicLetterTokenizer( reader ); >>>>>>> result = new StopFilter( result, stoptable ); >>>>>>> result = new LowerCaseFilter(result); >>>>>>> result = new ArabicNormalizationFilter( result ); >>>>>>> result = new ArabicStemFilter( result ); >>>>>>> >>>>>>> return result; >>>>>>> >>>>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter? >>>>>>> >>>>>>> As a comparison the PersianAnalyzer has: >>>>>>> TokenStream result = new ArabicLetterTokenizer(reader); >>>>>>> result = new LowerCaseFilter(result); >>>>>>> result = new ArabicNormalizationFilter(result); >>>>>>> /* additional persian-specific normalization */ >>>>>>> result = new PersianNormalizationFilter(result); >>>>>>> /* >>>>>>> * the order here is important: the stopword list is normalized >>>>>>> >>>>>>> >>>>>> with >>>>> >>>>> >>>>>> the >>>>>>> * above! >>>>>>> */ >>>>>>> result = new StopFilter(result, stoptable); >>>>>>> >>>>>>> return result; >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> DM >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Robert Muir >>>>>> rcm...@gmail.com >>>>>> >>>>> >> > > > -- > Robert Muir > rcm...@gmail.com > -- Robert Muir rcm...@gmail.com