Robert,
Yes it is tricky.
I'm not suggesting that the ArabicAnalyzer have any stopwords other than
Arabic.
I'm suggesting that if I know my input document well and know that it
has mixed text and that the text is Arabic and one other known language
that I might want to augment the stop list with stop words appropriate
for that known language. I think that in this case, stop filter should
be after lower case filter.
As to lower casing across the board, I also think it is pretty safe. But
I think there are some edge cases. For example, lowercasing a Greek word
in all upper case ending in sigma will not produce the same as lower
casing the same Greek word in all lower case. The Greek word should have
a final sigma rather than a small sigma. For Greek, using an
UpperCaseFilter followed by a LowerCaseFilter would handle this case.
IMHO, this is not an issue for the Arabic or Persian analyzers.
-- DM
On 10/08/2009 09:36 AM, Robert Muir wrote:
DM, i suppose. but this is a tricky subject, what if you have mixed
Arabic / German or something like that?
for some other languages written in the Latin script, English
stopwords could be bad :)
I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty
safe across the board though.
On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com
<mailto:dmsmith...@gmail.com>> wrote:
On 10/08/2009 09:23 AM, Uwe Schindler wrote:
Just an addition: The lowercase filter is only for the case of
embedded
non-arabic words. And these will not appear in the stop words.
I learned something new!
Hmm. If one has a mixed Arabic / English text, shouldn't one be
able to augment the stopwords list with English stop words? And if
so, shouldn't the stop filter come after the lower case filter?
-- DM
-----Original Message-----
From: Basem Narmok [mailto:nar...@gmail.com
<mailto:nar...@gmail.com>]
Sent: Thursday, October 08, 2009 4:20 PM
To: java-dev@lucene.apache.org
<mailto:java-dev@lucene.apache.org>
Subject: Re: Arabic Analyzer: possible bug
DM, there is no upper/lower cases in Arabic, so don't
worry, but the
stop word list needs some corrections and may miss some
common/stop
Arabic words.
Best,
On Thu, Oct 8, 2009 at 4:14 PM, DM
Smith<dmsmith...@gmail.com <mailto:dmsmith...@gmail.com>>
wrote:
Robert,
Thanks for the info.
As I said, I am illiterate in Arabic. So I have
another, perhaps
nonsensical, question:
Does the stop word list have every combination of
upper/lower case for
each
Arabic word in the list? (i.e. is it fully
de-normalized?) Or should it
come
after LowerCaseFilter?
-- DM
On Oct 8, 2009, at 8:37 AM, Robert Muir wrote:
DM, this isn't a bug.
The arabic stopwords are not normalized.
but for persian, i normalized the stopwords. mostly
because i did not
want
to have to create variations with farsi yah versus
arabic yah for each
one.
On Thu, Oct 8, 2009 at 7:24 AM, DM
Smith<dmsmith...@gmail.com
<mailto:dmsmith...@gmail.com>> wrote:
I'm wondering if there is a bug in ArabicAnalyzer
in 2.9. (I don't
know
Arabic or Farsi, but have some texts to index in
those languages.)
The tokenizer/filter chain for ArabicAnalyzer is:
TokenStream result = new
ArabicLetterTokenizer( reader );
result = new StopFilter( result, stoptable );
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(
result );
result = new ArabicStemFilter( result );
return result;
Shouldn't the StopFilter come after
ArabicNormalizationFilter?
As a comparison the PersianAnalyzer has:
TokenStream result = new
ArabicLetterTokenizer(reader);
result = new LowerCaseFilter(result);
result = new ArabicNormalizationFilter(result);
/* additional persian-specific normalization */
result = new PersianNormalizationFilter(result);
/*
* the order here is important: the stopword
list is normalized
with
the
* above!
*/
result = new StopFilter(result, stoptable);
return result;
Thanks,
DM
--
Robert Muir
rcm...@gmail.com <mailto:rcm...@gmail.com>