Valery,

One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.


On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote:
>
> Hi all,
>
> I am trying to tune Lucene to respect such tokens like C++, C#, .NET
>
> The task is known for Lucene community, but surprisingly I can't google out
> somewhat good info on it.
>
> Of course, I tried to re-use Lucene's  building blocks for Tokenizer. Here
> we go:
>
>  1) StandardTokenizer -- oh, this option would be just fantastic, but "C++,
> C#, .NET" ends up with "c c net". Too bad.
>
>  2) WhitespaceTokenizer gives me a lot of lexems that are actually should
> have been chopped into smaller pieces. Example: "C/C++" comes out like a
> single lexem. If I follow this way I end-up with "Tokenization of tokens" --
> that sounds a bit odd, doesn't it?
>
>  3) CharTokenizer allows me to add the '/' to be also a token-emitting
> char, but then '/' gets immediately lost like those whitespace chars. In
> result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search the
> original char stream for the "/" char to re-build "SAP R/3" term as a whole.
>
> Do you see any other relevant building blocks missed by me?
>
> Also, people around there have meant that such problem should be solved by a
> synonym dictionary. However this hint sheds no light on which tokenization
> strategy should be more appropriate *before* the synonym step.
>
> So, it looks like I have to take the class CharTokenizer as for the starting
> point and write anew my own Tokenizer. This Tokenizer should also react on
> delimiting characters and emit the token. However, it should distinguish
> between delimiters like whitespaces along with ";,?" and the delimiters like
> "./&".
>
> Indeed, the delimiters like whitespaces and ";,?" should be thrown away from
> Lexem level,
> whereas the token emitting characters like "./&" should be kept in Lexem
> level.
>
> Your comments, gurus?
>
> regards,
> Valery
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to