Hi Robert,
so, would you expect a Tokenizer to consider '/' in both cases as a separate
Token?
Personally, I see no problem if Tokenzer would do the following job:
"C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
and come up with "C" and "C++" tokens after processing through the
downstream tokenfilters.
Similarly:
"SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
I try to follow a spirit that a token (or its lexem) usually should never be
parsed again. One can build more complex (compound) things from the tokens.
However, usually one never chops a lexem into smaller pieces.
What do you think, Robert?
regards,
Valery
--
View this message in context:
http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]