Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Robert Muir Thu, 20 Aug 2009 11:48:45 -0700

Valery, I think it all depends on how you want your search to work.

when I say this, I mean for example: if a document only contains "C++"
do you want searches on just "C" to match or not?


another thing I would suggest is to take a look at the capabilities of
Solr: it has some analysis stuff that might be beneficial for your
needs.
wiki page is here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


On Thu, Aug 20, 2009 at 1:46 PM, Valery<[email protected]> wrote:
>
> Hi Robert,
>
> so, would you expect a Tokenizer to consider '/' in both cases as a separate
> Token?
>
> Personally, I see no problem if Tokenzer would do the following job:
>
> "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"}
> and come up with "C" and "C++" tokens after processing through the
> downstream tokenfilters.
>
> Similarly:
>
> "SAP R/3" ==> TokenStream of { "SAP", "R", "/", "3"}
> and getting { "SAP", "R", "/", "3", "R/3", "SAP R/3"} later.
>
> I try to follow a spirit that a token (or its lexem) usually should never be
> parsed again. One can build  more complex (compound) things from the tokens.
> However, usually one never chops a lexem into smaller pieces.
>
> What do you think, Robert?
>
> regards,
> Valery
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25066762.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to