Quesion concerning Arabic analyzer

Michal Diamantstein Mon, 30 Mar 2015 09:28:27 -0700

Hi,
I'm a software developer at Genesys and we use Lucene in our product.
Lately we added support in Arabic which includes indexing (write and read) data 
in this language.
Using ArabicLetterTokenizer  from 
http://lucenenet.apache.org/docs/3.0.3/dc/d1c/_arabic_letter_tokenizer_8cs_source.html
I bump into some issue -
The function IsTokenChar(char c) does not allow numbers while parsing.


/**
         * Allows for Letter category or NonspacingMark category
         * @see org.apache.lucene.analysis.LetterTokenizer#isTokenChar(char)
         */
        protected internal override bool IsTokenChar(char c)
        {
          return base.IsTokenChar(c) || char.GetUnicodeCategory(c) == 
System.Globalization.UnicodeCategory.NonSpacingMark;
        }


What is the reason for not allowing numbers?

The process includes using the analyzer to get all the tokens,
and then build a TermQuery, PhraseQuery, or nothing based on the term count.
While going over the tokens, numbers are dropped out).

Thanks in advance.


Michal Diamantstein
Software Engineer
T:  +972 72 220 1866
M: +972 50 424 5533
[email protected]<mailto:[email protected]>





[Geneys_logo_RGB]<http://www.genesyslab.com/>

Quesion concerning Arabic analyzer

Reply via email to