Valery, I think it all depends on how you want your search to work. when I say this, I mean for example: if a document only contains "C++" do you want searches on just "C" to match or not?
another thing I would suggest is to take a look at the capabilities of Solr: it has some analysis stuff that might be beneficial for your needs. wiki page is here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters On Thu, Aug 20, 2009 at 1:46 PM, Valery<khame...@gmail.com> wrote: > > Hi Robert, > > so, would you expect a Tokenizer to consider '/' in both cases as a separate > Token? > > Personally, I see no problem if Tokenzer would do the following job: > > "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} > and come up with "C" and "C++" tokens after processing through the > downstream tokenfilters. > > Similarly: > > "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"} > and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later. > > I try to follow a spirit that a token (or its lexem) usually should never be > parsed again. One can build more complex (compound) things from the tokens. > However, usually one never chops a lexem into smaller pieces. > > What do you think, Robert? > > regards, > Valery > > -- > View this message in context: > http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org