Hi Valery,
From our experience at Krugle, we wound up having to create our own
tokenizers (actually kind of specialized parser) for the different
languages. It didn't seem like a good option to try to twist one of
the existing tokenizers into something that would work well enough. We
wound up using ANTLR for this.
-- Ken
On Aug 20, 2009, at 8:09am, Valery wrote:
Hi Robert,
thanks for the hint.
Indeed, a natural way to go. Especially if one builds a Tokenizer of
the
level of quality like StandardTokenizer's.
OTOH, you mean that the out-of-the-box stuff is indeed not
customizable for
this task?..
regards
Valery
Robert Muir wrote:
Valery,
One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.
On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote:
Hi all,
I am trying to tune Lucene to respect such tokens like C++, C#, .NET
The task is known for Lucene community, but surprisingly I can't
google
out
somewhat good info on it.
Of course, I tried to re-use Lucene's building blocks for
Tokenizer.
Here
we go:
1) StandardTokenizer -- oh, this option would be just fantastic,
but
"C++,
C#, .NET" ends up with "c c net". Too bad.
2) WhitespaceTokenizer gives me a lot of lexems that are actually
should
have been chopped into smaller pieces. Example: "C/C++" comes out
like a
single lexem. If I follow this way I end-up with "Tokenization of
tokens"
--
that sounds a bit odd, doesn't it?
3) CharTokenizer allows me to add the '/' to be also a token-
emitting
char, but then '/' gets immediately lost like those whitespace
chars. In
result "SAP R/3" ends up with "SAP" "R" "3" and one will need to
search
the
original char stream for the "/" char to re-build "SAP R/3" term
as a
whole.
Do you see any other relevant building blocks missed by me?
Also, people around there have meant that such problem should be
solved
by a
synonym dictionary. However this hint sheds no light on which
tokenization
strategy should be more appropriate *before* the synonym step.
So, it looks like I have to take the class CharTokenizer as for the
starting
point and write anew my own Tokenizer. This Tokenizer should also
react
on
delimiting characters and emit the token. However, it should
distinguish
between delimiters like whitespaces along with ";,?" and the
delimiters
like
"./&".
Indeed, the delimiters like whitespaces and ";,?" should be thrown
away
from
Lexem level,
whereas the token emitting characters like "./&" should be kept in
Lexem
level.
Your comments, gurus?
regards,
Valery
--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Robert Muir
rcm...@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org