Hi guys,
Nick from Atlassian here. We had a customer complain that they could
not search on numbers when using Russian as there indexing language.
I tracked this down to the RussianLetterTokenizer.
This extends the CharTokenizer and basically tokenizes on anything that
isn't a letter - Character.isLetter() or is not included in a char array
that is passed in the constructor. It effectively will ignore numbers.
We were passing in the RussianCharsets.UnicodeRussian charset to the
constructor.
I can get around this issue by adding the chars 0-9 to the passed in
char set.
From what I can tell, there shouldn't be any side-effects to this.
Though I don't think this is the correct solution.
What I am wondering is there any reason why they didn't use the
StandardTokenizer with an extended char set? And is this something we
should look at fixing? Not speaking Russian, I can't tell if this is
the correct way to do it.
They would then benefit from the greater functionality provided by the
StandardTokenizer.
I have also notice some other languages go down this path. E.g. Greek
Cheers,
Nick
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]