Russian Tokenizing

Nick Menere Mon, 18 Jun 2007 22:04:27 -0700

Hi guys,

Nick from Atlassian here. We had a customer complain that they couldnot search on numbers when using Russian as there indexing language.


I tracked this down to the RussianLetterTokenizer.

This extends the CharTokenizer and basically tokenizes on anything thatisn't a letter - Character.isLetter() or is not included in a char arraythat is passed in the constructor. It effectively will ignore numbers.

We were passing in the RussianCharsets.UnicodeRussian charset to theconstructor.I can get around this issue by adding the chars 0-9 to the passed inchar set.

From what I can tell, there shouldn't be any side-effects to this.Though I don't think this is the correct solution.

What I am wondering is there any reason why they didn't use theStandardTokenizer with an extended char set? And is this something weshould look at fixing? Not speaking Russian, I can't tell if this isthe correct way to do it.They would then benefit from the greater functionality provided by theStandardTokenizer.


I have also notice some other languages go down this path.  E.g. Greek

Cheers,
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Russian Tokenizing

Reply via email to