Hi,
I'm a software developer at Genesys and we use Lucene in our product.
Lately we added support in Arabic which includes indexing (write and read) data
in this language.
Using ArabicLetterTokenizer from
http://lucenenet.apache.org/docs/3.0.3/dc/d1c/_arabic_letter_tokenizer_8cs_source.html
I bump into some issue -
The function IsTokenChar(char c) does not allow numbers while parsing.
/**
* Allows for Letter category or NonspacingMark category
* @see org.apache.lucene.analysis.LetterTokenizer#isTokenChar(char)
*/
protected internal override bool IsTokenChar(char c)
{
return base.IsTokenChar(c) || char.GetUnicodeCategory(c) ==
System.Globalization.UnicodeCategory.NonSpacingMark;
}
What is the reason for not allowing numbers?
The process includes using the analyzer to get all the tokens,
and then build a TermQuery, PhraseQuery, or nothing based on the term count.
While going over the tokens, numbers are dropped out).
Thanks in advance.
Michal Diamantstein
Software Engineer
T: +972 72 220 1866
M: +972 50 424 5533
[email protected]<mailto:[email protected]>
[Geneys_logo_RGB]<http://www.genesyslab.com/>