Re: Arabic Analyzer: possible bug

DM Smith Thu, 08 Oct 2009 08:29:27 -0700

Robert,
Yes it is tricky.

I'm not suggesting that the ArabicAnalyzer have any stopwords other thanArabic.

I'm suggesting that if I know my input document well and know that ithas mixed text and that the text is Arabic and one other known languagethat I might want to augment the stop list with stop words appropriatefor that known language. I think that in this case, stop filter shouldbe after lower case filter.

As to lower casing across the board, I also think it is pretty safe. ButI think there are some edge cases. For example, lowercasing a Greek wordin all upper case ending in sigma will not produce the same as lowercasing the same Greek word in all lower case. The Greek word should havea final sigma rather than a small sigma. For Greek, using anUpperCaseFilter followed by a LowerCaseFilter would handle this case.


IMHO, this is not an issue for the Arabic or Persian analyzers.

-- DM

On 10/08/2009 09:36 AM, Robert Muir wrote:

DM, i suppose. but this is a tricky subject, what if you have mixedArabic / German or something like that?

for some other languages written in the Latin script, Englishstopwords could be bad :)

I think that Lowercasing non-Arabic (also cyrillic, etc), is prettysafe across the board though.

On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com<mailto:dmsmith...@gmail.com>> wrote:


    On 10/08/2009 09:23 AM, Uwe Schindler wrote:

        Just an addition: The lowercase filter is only for the case of
        embedded
        non-arabic words. And these will not appear in the stop words.

    I learned something new!

    Hmm. If one has a mixed Arabic / English text, shouldn't one be
    able to augment the stopwords list with English stop words? And if
    so, shouldn't the stop filter come after the lower case filter?

    -- DM


            -----Original Message-----
            From: Basem Narmok [mailto:nar...@gmail.com
            <mailto:nar...@gmail.com>]
            Sent: Thursday, October 08, 2009 4:20 PM
            To: java-dev@lucene.apache.org
            <mailto:java-dev@lucene.apache.org>
            Subject: Re: Arabic Analyzer: possible bug

            DM, there is no upper/lower cases in Arabic, so don't
            worry, but the
            stop word list needs some corrections and may miss some
            common/stop
            Arabic words.

            Best,

            On Thu, Oct 8, 2009 at 4:14 PM, DM
            Smith<dmsmith...@gmail.com <mailto:dmsmith...@gmail.com>>
             wrote:

                Robert,
                Thanks for the info.
                As I said, I am illiterate in Arabic. So I have
                another, perhaps
                nonsensical, question:
                Does the stop word list have every combination of
                upper/lower case for

            each

                Arabic word in the list? (i.e. is it fully
                de-normalized?) Or should it

            come

                after LowerCaseFilter?
                -- DM
                On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

                DM, this isn't a bug.

                The arabic stopwords are not normalized.

                but for persian, i normalized the stopwords. mostly
                because i did not

            want

                to have to create variations with farsi yah versus
                arabic yah for each

            one.

                On Thu, Oct 8, 2009 at 7:24 AM, DM
                Smith<dmsmith...@gmail.com
                <mailto:dmsmith...@gmail.com>>  wrote:

                    I'm wondering if there is  a bug in ArabicAnalyzer
                    in 2.9. (I don't

            know

                    Arabic or Farsi, but have some texts to index in
                    those languages.)
                    The tokenizer/filter chain for ArabicAnalyzer is:
                            TokenStream result = new
                    ArabicLetterTokenizer( reader );
                            result = new StopFilter( result, stoptable );
                            result = new LowerCaseFilter(result);
                            result = new ArabicNormalizationFilter(
                    result );
                            result = new ArabicStemFilter( result );

                            return result;

                    Shouldn't the StopFilter come after
                    ArabicNormalizationFilter?

                    As a comparison the PersianAnalyzer has:
                        TokenStream result = new
                    ArabicLetterTokenizer(reader);
                        result = new LowerCaseFilter(result);
                        result = new ArabicNormalizationFilter(result);
                        /* additional persian-specific normalization */
                        result = new PersianNormalizationFilter(result);
                        /*
                         * the order here is important: the stopword
                    list is normalized

            with

                    the
                         * above!
                         */
                        result = new StopFilter(result, stoptable);

                        return result;


                    Thanks,
                    DM


                --
                Robert Muir
                rcm...@gmail.com <mailto:rcm...@gmail.com>

Re: Arabic Analyzer: possible bug

Reply via email to