Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Robert Muir Fri, 21 Aug 2009 04:56:31 -0700

Valery,

FWIW, to answer this question, I think the answer is still "it depends".
I agree with John, I think it is much easier for your tokenizer to
create tokens that contain all the context you need for the downstream
filters to do their job.
I don't think you can put some exact specification on what this is, it
really does depend on a lot of things, mostly the way that you need to
handle text for your application.


For an extreme example, the Tokenizer in lucene contrib's
SmartChineseAnalyzer (SentenceTokenizer) actually outputs entire
phrases as tokens.
This is because the downstream tokenfilter (WordTokenFilter) needs
that kind of context to subdivide the phrases into words.

On Fri, Aug 21, 2009 at 7:45 AM, Valery<khame...@gmail.com> wrote:
>
>
> Simon Willnauer wrote:
>>
>> you could do
>> the whole job in a Tokenizer but this would not be a good separation
>> of concerns right!?
>>
>
> right, it wouldn't be a good separation of concerns.
> That's why I wanted to know what you consider as "Tokenizer's job".
>
>
> --
> View this message in context: 
> http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25077083.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

Reply via email to