Re: Arabic Analyzer: possible bug

DM Smith Thu, 08 Oct 2009 06:15:35 -0700

Robert,
Thanks for the info.

As I said, I am illiterate in Arabic. So I have another, perhapsnonsensical, question:Does the stop word list have every combination of upper/lower case foreach Arabic word in the list? (i.e. is it fully de-normalized?) Orshould it come after LowerCaseFilter?


-- DM

On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

DM, this isn't a bug.

The arabic stopwords are not normalized.

but for persian, i normalized the stopwords. mostly because i didnot want to have to create variations with farsi yah versus arabicyah for each one.


On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <[email protected]> wrote:

I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don'tknow Arabic or Farsi, but have some texts to index in thoselanguages.)


The tokenizer/filter chain for ArabicAnalyzer is:
        TokenStream result = new ArabicLetterTokenizer( reader );
        result = new StopFilter( result, stoptable );
        result = new LowerCaseFilter(result);
        result = new ArabicNormalizationFilter( result );
        result = new ArabicStemFilter( result );

        return result;

Shouldn't the StopFilter come after ArabicNormalizationFilter?


As a comparison the PersianAnalyzer has:
    TokenStream result = new ArabicLetterTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new ArabicNormalizationFilter(result);
    /* additional persian-specific normalization */
    result = new PersianNormalizationFilter(result);
    /*

* the order here is important: the stopword list is normalizedwith the

     * above!
     */
    result = new StopFilter(result, stoptable);

    return result;


Thanks,
        DM



--
Robert Muir
[email protected]

Re: Arabic Analyzer: possible bug

Reply via email to