Hi all, I am trying to tune Lucene to respect such tokens like C++, C#, .NET
The task is known for Lucene community, but surprisingly I can't google out somewhat good info on it. Of course, I tried to re-use Lucene's building blocks for Tokenizer. Here we go: 1) StandardTokenizer -- oh, this option would be just fantastic, but "C++, C#, .NET" ends up with "c c net". Too bad. 2) WhitespaceTokenizer gives me a lot of lexems that are actually should have been chopped into smaller pieces. Example: "C/C++" comes out like a single lexem. If I follow this way I end-up with "Tokenization of tokens" -- that sounds a bit odd, doesn't it? 3) CharTokenizer allows me to add the '/' to be also a token-emitting char, but then '/' gets immediately lost like those whitespace chars. In result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the original char stream for the "/" char to re-build "SAP R/3" term as a whole. Do you see any other relevant building blocks missed by me? Also, people around there have meant that such problem should be solved by a synonym dictionary. However this hint sheds no light on which tokenization strategy should be more appropriate *before* the synonym step. So, it looks like I have to take the class CharTokenizer as for the starting point and write anew my own Tokenizer. This Tokenizer should also react on delimiting characters and emit the token. However, it should distinguish between delimiters like whitespaces along with ";,?" and the delimiters like "./&". Indeed, the delimiters like whitespaces and ";,?" should be thrown away from Lexem level, whereas the token emitting characters like "./&" should be kept in Lexem level. Your comments, gurus? regards, Valery -- View this message in context: http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org