Hi guys,

Nick from Atlassian here. We had a customer complain that they could not search on numbers when using Russian as there indexing language.

I tracked this down to the RussianLetterTokenizer.
This extends the CharTokenizer and basically tokenizes on anything that isn't a letter - Character.isLetter() or is not included in a char array that is passed in the constructor. It effectively will ignore numbers.

We were passing in the RussianCharsets.UnicodeRussian charset to the constructor. I can get around this issue by adding the chars 0-9 to the passed in char set.

From what I can tell, there shouldn't be any side-effects to this. Though I don't think this is the correct solution.

What I am wondering is there any reason why they didn't use the StandardTokenizer with an extended char set? And is this something we should look at fixing? Not speaking Russian, I can't tell if this is the correct way to do it. They would then benefit from the greater functionality provided by the StandardTokenizer.

I have also notice some other languages go down this path.  E.g. Greek

Cheers,
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to