Hi Valery,

From our experience at Krugle, we wound up having to create our own tokenizers (actually kind of specialized parser) for the different languages. It didn't seem like a good option to try to twist one of the existing tokenizers into something that would work well enough. We wound up using ANTLR for this.

-- Ken


On Aug 20, 2009, at 8:09am, Valery wrote:


Hi Robert,

thanks for the hint.

Indeed, a natural way to go. Especially if one builds a Tokenizer of the
level of quality like StandardTokenizer's.

OTOH, you mean that the out-of-the-box stuff is indeed not customizable for
this task?..

regards
Valery



Robert Muir wrote:

Valery,

One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.


On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote:

Hi all,

I am trying to tune Lucene to respect such tokens like C++, C#, .NET

The task is known for Lucene community, but surprisingly I can't google
out
somewhat good info on it.

Of course, I tried to re-use Lucene's building blocks for Tokenizer.
Here
we go:

1) StandardTokenizer -- oh, this option would be just fantastic, but
"C++,
C#, .NET" ends up with "c c net". Too bad.

2) WhitespaceTokenizer gives me a lot of lexems that are actually should have been chopped into smaller pieces. Example: "C/C++" comes out like a single lexem. If I follow this way I end-up with "Tokenization of tokens"
--
that sounds a bit odd, doesn't it?

3) CharTokenizer allows me to add the '/' to be also a token- emitting char, but then '/' gets immediately lost like those whitespace chars. In result "SAP R/3" ends up with "SAP" "R" "3" and one will need to search
the
original char stream for the "/" char to re-build "SAP R/3" term as a
whole.

Do you see any other relevant building blocks missed by me?

Also, people around there have meant that such problem should be solved
by a
synonym dictionary. However this hint sheds no light on which
tokenization
strategy should be more appropriate *before* the synonym step.

So, it looks like I have to take the class CharTokenizer as for the
starting
point and write anew my own Tokenizer. This Tokenizer should also react
on
delimiting characters and emit the token. However, it should distinguish between delimiters like whitespaces along with ";,?" and the delimiters
like
"./&".

Indeed, the delimiters like whitespaces and ";,?" should be thrown away
from
Lexem level,
whereas the token emitting characters like "./&" should be kept in Lexem
level.

Your comments, gurus?

regards,
Valery

--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





--
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to