Valery, oh I think there might be other ways to solve this. But you provided some examples such as C/C++ and SAP R/3. In these two examples you want the "/" to behave differently depending upon context, so my first thought was that a grammar might be a good way to ensure it does what you want.
On Thu, Aug 20, 2009 at 11:09 AM, Valery<khame...@gmail.com> wrote: > > Hi Robert, > > thanks for the hint. > > Indeed, a natural way to go. Especially if one builds a Tokenizer of the > level of quality like StandardTokenizer's. > > OTOH, you mean that the out-of-the-box stuff is indeed not customizable for > this task?.. > > regards > Valery > > > > Robert Muir wrote: >> >> Valery, >> >> One thing you could try would be to create a JFlex-based tokenizer, >> specifying a grammar with the rules you want. >> You could use the source code & grammar of StandardTokenizer as a >> starting point. >> >> >> On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote: >>> >>> Hi all, >>> >>> I am trying to tune Lucene to respect such tokens like C++, C#, .NET >>> >>> The task is known for Lucene community, but surprisingly I can't google >>> out >>> somewhat good info on it. >>> >>> Of course, I tried to re-use Lucene's building blocks for Tokenizer. >>> Here >>> we go: >>> >>> 1) StandardTokenizer -- oh, this option would be just fantastic, but >>> "C++, >>> C#, .NET" ends up with "c c net". Too bad. >>> >>> 2) WhitespaceTokenizer gives me a lot of lexems that are actually should >>> have been chopped into smaller pieces. Example: "C/C++" comes out like a >>> single lexem. If I follow this way I end-up with "Tokenization of tokens" >>> -- >>> that sounds a bit odd, doesn't it? >>> >>> 3) CharTokenizer allows me to add the '/' to be also a token-emitting >>> char, but then '/' gets immediately lost like those whitespace chars. In >>> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search >>> the >>> original char stream for the "/" char to re-build "SAP R/3" term as a >>> whole. >>> >>> Do you see any other relevant building blocks missed by me? >>> >>> Also, people around there have meant that such problem should be solved >>> by a >>> synonym dictionary. However this hint sheds no light on which >>> tokenization >>> strategy should be more appropriate *before* the synonym step. >>> >>> So, it looks like I have to take the class CharTokenizer as for the >>> starting >>> point and write anew my own Tokenizer. This Tokenizer should also react >>> on >>> delimiting characters and emit the token. However, it should distinguish >>> between delimiters like whitespaces along with ";,?" and the delimiters >>> like >>> "./&". >>> >>> Indeed, the delimiters like whitespaces and ";,?" should be thrown away >>> from >>> Lexem level, >>> whereas the token emitting characters like "./&" should be kept in Lexem >>> level. >>> >>> Your comments, gurus? >>> >>> regards, >>> Valery >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > -- > View this message in context: > http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org