Re: Arabic Analyzer: possible bug

DM Smith Thu, 08 Oct 2009 06:30:31 -0700

On 10/08/2009 09:23 AM, Uwe Schindler wrote:

Just an addition: The lowercase filter is only for the case of embedded
non-arabic words. And these will not appear in the stop words.

I learned something new!

Hmm. If one has a mixed Arabic / English text, shouldn't one be able toaugment the stopwords list with English stop words? And if so, shouldn'tthe stop filter come after the lower case filter?


-- DM

-----Original Message-----
From: Basem Narmok [mailto:[email protected]]
Sent: Thursday, October 08, 2009 4:20 PM
To: [email protected]
Subject: Re: Arabic Analyzer: possible bug

DM, there is no upper/lower cases in Arabic, so don't worry, but the
stop word list needs some corrections and may miss some common/stop
Arabic words.

Best,

On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<[email protected]>  wrote:

Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have another, perhaps
nonsensical, question:
Does the stop word list have every combination of upper/lower case for

each

Arabic word in the list? (i.e. is it fully de-normalized?) Or should it

come

after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i did not

want

to have to create variations with farsi yah versus arabic yah for each

one.

On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<[email protected]>  wrote:

I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't

know

Arabic or Farsi, but have some texts to index in those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
         TokenStream result = new ArabicLetterTokenizer( reader );
         result = new StopFilter( result, stoptable );
         result = new LowerCaseFilter(result);
         result = new ArabicNormalizationFilter( result );
         result = new ArabicStemFilter( result );

         return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?

As a comparison the PersianAnalyzer has:
     TokenStream result = new ArabicLetterTokenizer(reader);
     result = new LowerCaseFilter(result);
     result = new ArabicNormalizationFilter(result);
     /* additional persian-specific normalization */
     result = new PersianNormalizationFilter(result);
     /*
      * the order here is important: the stopword list is normalized

with

the
      * above!
      */
     result = new StopFilter(result, stoptable);

     return result;


Thanks,
DM


--
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Arabic Analyzer: possible bug

Reply via email to