Re: Arabic Analyzer: possible bug

Robert Muir Thu, 08 Oct 2009 10:52:09 -0700

Basem, I really appreciate your time if you are able to do this.

Its been my hope that introducing Arabic/Farsi support will create enough
interest to encourage more qualified people to come and really make things
nice.


If you don't mind, you can look at
http://wiki.apache.org/lucene-java/HowToContribute and create a JIRA Issue
with a patch file to improve our stopwords list.

Otherwise, in my opinion a good list is also acceptable and I will volunteer
to turn it into a patch :)

On Thu, Oct 8, 2009 at 9:32 AM, Basem Narmok <[email protected]> wrote:

> Robert,
>
> I will be happy to do so. Currently, I am testing the new Arabic
> analyzer in 2.9, and also I will prepare a new stop word list. I will
> provide you with my findings/comments soon.
>
> Best,
>
> On Thu, Oct 8, 2009 at 4:28 PM, Robert Muir <[email protected]> wrote:
> > Basem, by any chance would you be willing to help improve it for us?
> >
> > On Thu, Oct 8, 2009 at 9:20 AM, Basem Narmok <[email protected]> wrote:
> >>
> >> DM, there is no upper/lower cases in Arabic, so don't worry, but the
> >> stop word list needs some corrections and may miss some common/stop
> >> Arabic words.
> >>
> >> Best,
> >>
> >> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith <[email protected]> wrote:
> >> > Robert,
> >> > Thanks for the info.
> >> > As I said, I am illiterate in Arabic. So I have another, perhaps
> >> > nonsensical, question:
> >> > Does the stop word list have every combination of upper/lower case for
> >> > each
> >> > Arabic word in the list? (i.e. is it fully de-normalized?) Or should
> it
> >> > come
> >> > after LowerCaseFilter?
> >> > -- DM
> >> > On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
> >> >
> >> > DM, this isn't a bug.
> >> >
> >> > The arabic stopwords are not normalized.
> >> >
> >> > but for persian, i normalized the stopwords. mostly because i did not
> >> > want
> >> > to have to create variations with farsi yah versus arabic yah for each
> >> > one.
> >> >
> >> > On Thu, Oct 8, 2009 at 7:24 AM, DM Smith <[email protected]>
> wrote:
> >> >>
> >> >> I'm wondering if there is  a bug in ArabicAnalyzer in 2.9. (I don't
> >> >> know
> >> >> Arabic or Farsi, but have some texts to index in those languages.)
> >> >> The tokenizer/filter chain for ArabicAnalyzer is:
> >> >>         TokenStream result = new ArabicLetterTokenizer( reader );
> >> >>         result = new StopFilter( result, stoptable );
> >> >>         result = new LowerCaseFilter(result);
> >> >>         result = new ArabicNormalizationFilter( result );
> >> >>         result = new ArabicStemFilter( result );
> >> >>
> >> >>         return result;
> >> >>
> >> >> Shouldn't the StopFilter come after ArabicNormalizationFilter?
> >> >>
> >> >> As a comparison the PersianAnalyzer has:
> >> >>     TokenStream result = new ArabicLetterTokenizer(reader);
> >> >>     result = new LowerCaseFilter(result);
> >> >>     result = new ArabicNormalizationFilter(result);
> >> >>     /* additional persian-specific normalization */
> >> >>     result = new PersianNormalizationFilter(result);
> >> >>     /*
> >> >>      * the order here is important: the stopword list is normalized
> >> >> with
> >> >> the
> >> >>      * above!
> >> >>      */
> >> >>     result = new StopFilter(result, stoptable);
> >> >>
> >> >>     return result;
> >> >>
> >> >>
> >> >> Thanks,
> >> >> DM
> >> >
> >> >
> >> > --
> >> > Robert Muir
> >> > [email protected]
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
> >
> > --
> > Robert Muir
> > [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: Arabic Analyzer: possible bug

Reply via email to