Robert,
Yes it is tricky.

I'm not suggesting that the ArabicAnalyzer have any stopwords other than Arabic.

I'm suggesting that if I know my input document well and know that it has mixed text and that the text is Arabic and one other known language that I might want to augment the stop list with stop words appropriate for that known language. I think that in this case, stop filter should be after lower case filter.

As to lower casing across the board, I also think it is pretty safe. But I think there are some edge cases. For example, lowercasing a Greek word in all upper case ending in sigma will not produce the same as lower casing the same Greek word in all lower case. The Greek word should have a final sigma rather than a small sigma. For Greek, using an UpperCaseFilter followed by a LowerCaseFilter would handle this case.

IMHO, this is not an issue for the Arabic or Persian analyzers.

-- DM

On 10/08/2009 09:36 AM, Robert Muir wrote:
DM, i suppose. but this is a tricky subject, what if you have mixed Arabic / German or something like that?

for some other languages written in the Latin script, English stopwords could be bad :)

I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe across the board though.

On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com <mailto:dmsmith...@gmail.com>> wrote:

    On 10/08/2009 09:23 AM, Uwe Schindler wrote:

        Just an addition: The lowercase filter is only for the case of
        embedded
        non-arabic words. And these will not appear in the stop words.

    I learned something new!

    Hmm. If one has a mixed Arabic / English text, shouldn't one be
    able to augment the stopwords list with English stop words? And if
    so, shouldn't the stop filter come after the lower case filter?

    -- DM


            -----Original Message-----
            From: Basem Narmok [mailto:nar...@gmail.com
            <mailto:nar...@gmail.com>]
            Sent: Thursday, October 08, 2009 4:20 PM
            To: java-dev@lucene.apache.org
            <mailto:java-dev@lucene.apache.org>
            Subject: Re: Arabic Analyzer: possible bug

            DM, there is no upper/lower cases in Arabic, so don't
            worry, but the
            stop word list needs some corrections and may miss some
            common/stop
            Arabic words.

            Best,

            On Thu, Oct 8, 2009 at 4:14 PM, DM
            Smith<dmsmith...@gmail.com <mailto:dmsmith...@gmail.com>>
             wrote:

                Robert,
                Thanks for the info.
                As I said, I am illiterate in Arabic. So I have
                another, perhaps
                nonsensical, question:
                Does the stop word list have every combination of
                upper/lower case for

            each

                Arabic word in the list? (i.e. is it fully
                de-normalized?) Or should it

            come

                after LowerCaseFilter?
                -- DM
                On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

                DM, this isn't a bug.

                The arabic stopwords are not normalized.

                but for persian, i normalized the stopwords. mostly
                because i did not

            want

                to have to create variations with farsi yah versus
                arabic yah for each

            one.

                On Thu, Oct 8, 2009 at 7:24 AM, DM
                Smith<dmsmith...@gmail.com
                <mailto:dmsmith...@gmail.com>>  wrote:

                    I'm wondering if there is  a bug in ArabicAnalyzer
                    in 2.9. (I don't

            know

                    Arabic or Farsi, but have some texts to index in
                    those languages.)
                    The tokenizer/filter chain for ArabicAnalyzer is:
                            TokenStream result = new
                    ArabicLetterTokenizer( reader );
                            result = new StopFilter( result, stoptable );
                            result = new LowerCaseFilter(result);
                            result = new ArabicNormalizationFilter(
                    result );
                            result = new ArabicStemFilter( result );

                            return result;

                    Shouldn't the StopFilter come after
                    ArabicNormalizationFilter?

                    As a comparison the PersianAnalyzer has:
                        TokenStream result = new
                    ArabicLetterTokenizer(reader);
                        result = new LowerCaseFilter(result);
                        result = new ArabicNormalizationFilter(result);
                        /* additional persian-specific normalization */
                        result = new PersianNormalizationFilter(result);
                        /*
                         * the order here is important: the stopword
                    list is normalized

            with

                    the
                         * above!
                         */
                        result = new StopFilter(result, stoptable);

                        return result;


                    Thanks,
                    DM


                --
                Robert Muir
                rcm...@gmail.com <mailto:rcm...@gmail.com>


Reply via email to