Re: Arabic Analyzer: possible bug

DM Smith Thu, 08 Oct 2009 09:38:41 -0700

On 10/08/2009 11:46 AM, Robert Muir wrote:

DM by the way, if you want this lowercasing behavior with edge cases,check out LUCENE-1488. There is a case folding filter there, as wellas a normalization filter, and they interact correctly for what youwant :)

Robert,

So cool. I've been following the emails on this to java-dev that JIRAputs out, but I had not looked at the patch till now. Brought tears tomy eyes.


How ready is it? I'd like to use it if it is "good enough".

BTW, does it handle the case where ' (an apostrophe) is used as acharacter in some languages? (IIRC in some African languages it is awhistle.) That is, do you know whether ICU will consider the context ofadjacent characters in determining whether something is a word break?

its my understanding that contrib/analyzers should not have anyexternal dependencies,

That's my understanding too. But there has got to be a way to provide itw/o duplication of code.

so it could be eons before the jdk exposes these things

I'm using ICU now for that very reason. It takes too long for the JDK tobe current on anything let alone something that Java boasted of in theearly days.

, so I don't know what to do. It would be nice if things likeArabicAnalyzer handled greek edge cases correctly, don't you think?

I do think so. Maybe in the new package (org.apache.lucene.icu) have asubpackage analyzer that's dependant on contrib/analyzers. Or create aPluggableAnalyzer that one could supply a Tokenizer and an ordered listof Filters, changing the contrib/analyzers to derive from it. Or, usereflection to bring in the ICU ability if the lucene-icu.jar is present.Or, ...

Right now, for each of the contrib/analyzers I have my own copy thatmimics them but doesn't use the StandardAnalyzer/StandardFilter (I thinkI want to use LUCENE-1488), does NFKC normalization, optionally uses aStopFilter (sometimes it is hard to dig out the stop set from theanalyzers) and optionally uses a stemmer (snowball if available.)Basically, I like all the parts that were provided by acontrib/analyzer, but I have different requirements than how those partswere packaged by the contrib/analyzer's Analyzer. (Thus my question onthe order of filters in ArabicAnalyzer).

It'd really be nice if there were a way to specify that "tool chain".Ideally, I'd like to get the default chain, and modify it. (And I'd liketo store a description of that tool chain with the index, with versioninfo for each of the parts, so that I can tell when an index needs to berebuilt.)


-- DM

On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir <rcm...@gmail.com<mailto:rcm...@gmail.com>> wrote:


        I'm suggesting that if I know my input document well and know
        that it has mixed text and that the text is Arabic and one
        other known language that I might want to augment the stop
        list with stop words appropriate for that known language. I
        think that in this case, stop filter should be after lower
        case filter.

     I think this is a good idea?


        As to lower casing across the board, I also think it is pretty
        safe. But I think there are some edge cases. For example,
        lowercasing a Greek word in all upper case ending in sigma
        will not produce the same as lower casing the same Greek word
        in all lower case. The Greek word should have a final sigma
        rather than a small sigma. For Greek, using an UpperCaseFilter
        followed by a LowerCaseFilter would handle this case.

    or you could use unicode case folding. lowercasing is for display
    purposes, not search.


        IMHO, this is not an issue for the Arabic or Persian analyzers.

        -- DM


        On 10/08/2009 09:36 AM, Robert Muir wrote:

        DM, i suppose. but this is a tricky subject, what if you have
        mixed Arabic / German or something like that?

        for some other languages written in the Latin script, English
        stopwords could be bad :)

        I think that Lowercasing non-Arabic (also cyrillic, etc), is
        pretty safe across the board though.

        On Thu, Oct 8, 2009 at 9:29 AM, DM Smith
        <dmsmith...@gmail.com <mailto:dmsmith...@gmail.com>> wrote:

            On 10/08/2009 09:23 AM, Uwe Schindler wrote:

                Just an addition: The lowercase filter is only for
                the case of embedded
                non-arabic words. And these will not appear in the
                stop words.

            I learned something new!

            Hmm. If one has a mixed Arabic / English text, shouldn't
            one be able to augment the stopwords list with English
            stop words? And if so, shouldn't the stop filter come
            after the lower case filter?

            -- DM


                    -----Original Message-----
                    From: Basem Narmok [mailto:nar...@gmail.com
                    <mailto:nar...@gmail.com>]
                    Sent: Thursday, October 08, 2009 4:20 PM
                    To: java-dev@lucene.apache.org
                    <mailto:java-dev@lucene.apache.org>
                    Subject: Re: Arabic Analyzer: possible bug

                    DM, there is no upper/lower cases in Arabic, so
                    don't worry, but the
                    stop word list needs some corrections and may
                    miss some common/stop
                    Arabic words.

                    Best,

                    On Thu, Oct 8, 2009 at 4:14 PM, DM
                    Smith<dmsmith...@gmail.com
                    <mailto:dmsmith...@gmail.com>>  wrote:

                        Robert,
                        Thanks for the info.
                        As I said, I am illiterate in Arabic. So I
                        have another, perhaps
                        nonsensical, question:
                        Does the stop word list have every
                        combination of upper/lower case for

                    each

                        Arabic word in the list? (i.e. is it fully
                        de-normalized?) Or should it

                    come

                        after LowerCaseFilter?
                        -- DM
                        On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:

                        DM, this isn't a bug.

                        The arabic stopwords are not normalized.

                        but for persian, i normalized the stopwords.
                        mostly because i did not

                    want

                        to have to create variations with farsi yah
                        versus arabic yah for each

                    one.

                        On Thu, Oct 8, 2009 at 7:24 AM, DM
                        Smith<dmsmith...@gmail.com
                        <mailto:dmsmith...@gmail.com>>  wrote:

                            I'm wondering if there is  a bug in
                            ArabicAnalyzer in 2.9. (I don't

                    know

                            Arabic or Farsi, but have some texts to
                            index in those languages.)
                            The tokenizer/filter chain for
                            ArabicAnalyzer is:
                                    TokenStream result = new
                            ArabicLetterTokenizer( reader );
                                    result = new StopFilter( result,
                            stoptable );
                                    result = new LowerCaseFilter(result);
                                    result = new
                            ArabicNormalizationFilter( result );
                                    result = new ArabicStemFilter(
                            result );

                                    return result;

                            Shouldn't the StopFilter come after
                            ArabicNormalizationFilter?

                            As a comparison the PersianAnalyzer has:
                                TokenStream result = new
                            ArabicLetterTokenizer(reader);
                                result = new LowerCaseFilter(result);
                                result = new
                            ArabicNormalizationFilter(result);
                                /* additional persian-specific
                            normalization */
                                result = new
                            PersianNormalizationFilter(result);
                                /*
                                 * the order here is important: the
                            stopword list is normalized

                    with

                            the
                                 * above!
                                 */
                                result = new StopFilter(result,
                            stoptable);

                                return result;


                            Thanks,
                            DM


                        --
                        Robert Muir
                        rcm...@gmail.com <mailto:rcm...@gmail.com>

--Robert Muir

    rcm...@gmail.com <mailto:rcm...@gmail.com>




--
Robert Muir
rcm...@gmail.com <mailto:rcm...@gmail.com>

Re: Arabic Analyzer: possible bug

Reply via email to