Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Ken Krugler Thu, 20 Aug 2009 09:46:44 -0700

Hi Valery,

From our experience at Krugle, we wound up having to create our owntokenizers (actually kind of specialized parser) for the differentlanguages. It didn't seem like a good option to try to twist one ofthe existing tokenizers into something that would work well enough. Wewound up using ANTLR for this.


-- Ken


On Aug 20, 2009, at 8:09am, Valery wrote:

Hi Robert,

thanks for the hint.
Indeed, a natural way to go. Especially if one builds a Tokenizer ofthe
level of quality like StandardTokenizer's.
OTOH, you mean that the out-of-the-box stuff is indeed notcustomizable for
this task?..

regards
Valery



Robert Muir wrote:
Valery,

One thing you could try would be to create a JFlex-based tokenizer,
specifying a grammar with the rules you want.
You could use the source code & grammar of StandardTokenizer as a
starting point.


On Thu, Aug 20, 2009 at 10:28 AM, Valery<khame...@gmail.com> wrote:
Hi all,

I am trying to tune Lucene to respect such tokens like C++, C#, .NET
The task is known for Lucene community, but surprisingly I can'tgoogle
out
somewhat good info on it.
Of course, I tried to re-use Lucene's building blocks forTokenizer.
Here
we go:
1) StandardTokenizer -- oh, this option would be just fantastic,but
"C++,
C#, .NET" ends up with "c c net". Too bad.
2) WhitespaceTokenizer gives me a lot of lexems that are actuallyshouldhave been chopped into smaller pieces. Example: "C/C++" comes outlike asingle lexem. If I follow this way I end-up with "Tokenization oftokens"
--
that sounds a bit odd, doesn't it?
3) CharTokenizer allows me to add the '/' to be also a token-emittingchar, but then '/' gets immediately lost like those whitespacechars. Inresult "SAP R/3" ends up with "SAP" "R" "3" and one will need tosearch
the
original char stream for the "/" char to re-build "SAP R/3" termas a
whole.

Do you see any other relevant building blocks missed by me?
Also, people around there have meant that such problem should besolved
by a
synonym dictionary. However this hint sheds no light on which
tokenization
strategy should be more appropriate *before* the synonym step.

So, it looks like I have to take the class CharTokenizer as for the
starting
point and write anew my own Tokenizer. This Tokenizer should alsoreact
on
delimiting characters and emit the token. However, it shoulddistinguishbetween delimiters like whitespaces along with ";,?" and thedelimiters
like
"./&".
Indeed, the delimiters like whitespaces and ";,?" should be thrownaway
from
Lexem level,
whereas the token emitting characters like "./&" should be kept inLexem
level.

Your comments, gurus?

regards,
Valery

--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063175.html
Sent from the Lucene - Java Users mailing list archive atNabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
View this message in context: 
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25063964.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to