Valery, One thing you could try would be to create a JFlex-based tokenizer, specifying a grammar with the rules you want. You could use the source code & grammar of StandardTokenizer as a starting point.
On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote: > > Hi all, > > I am trying to tune Lucene to respect such tokens like C++, C#, .NET > > The task is known for Lucene community, but surprisingly I can't google out > somewhat good info on it. > > Of course, I tried to re-use Lucene's building blocks for Tokenizer. Here > we go: > > 1) StandardTokenizer -- oh, this option would be just fantastic, but "C++, > C#, .NET" ends up with "c c net". Too bad. > > 2) WhitespaceTokenizer gives me a lot of lexems that are actually should > have been chopped into smaller pieces. Example: "C/C++" comes out like a > single lexem. If I follow this way I end-up with "Tokenization of tokens" -- > that sounds a bit odd, doesn't it? > > 3) CharTokenizer allows me to add the '/' to be also a token-emitting > char, but then '/' gets immediately lost like those whitespace chars. In > result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the > original char stream for the "/" char to re-build "SAP R/3" term as a whole. > > Do you see any other relevant building blocks missed by me? > > Also, people around there have meant that such problem should be solved by a > synonym dictionary. However this hint sheds no light on which tokenization > strategy should be more appropriate *before* the synonym step. > > So, it looks like I have to take the class CharTokenizer as for the starting > point and write anew my own Tokenizer. This Tokenizer should also react on > delimiting characters and emit the token. However, it should distinguish > between delimiters like whitespaces along with ";,?" and the delimiters like > "./&". > > Indeed, the delimiters like whitespaces and ";,?" should be thrown away from > Lexem level, > whereas the token emitting characters like "./&" should be kept in Lexem > level. > > Your comments, gurus? > > regards, > Valery > > -- > View this message in context: > http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org