Valery, FWIW, to answer this question, I think the answer is still "it depends". I agree with John, I think it is much easier for your tokenizer to create tokens that contain all the context you need for the downstream filters to do their job. I don't think you can put some exact specification on what this is, it really does depend on a lot of things, mostly the way that you need to handle text for your application.
For an extreme example, the Tokenizer in lucene contrib's SmartChineseAnalyzer (SentenceTokenizer) actually outputs entire phrases as tokens. This is because the downstream tokenfilter (WordTokenFilter) needs that kind of context to subdivide the phrases into words. On Fri, Aug 21, 2009 at 7:45 AM, Valery<khame...@gmail.com> wrote: > > > Simon Willnauer wrote: >> >> you could do >> the whole job in a Tokenizer but this would not be a good separation >> of concerns right!? >> > > right, it wouldn't be a good separation of concerns. > That's why I wanted to know what you consider as "Tokenizer's job". > > > -- > View this message in context: > http://www.nabble.com/Any-Tokenizator-friendly-to-C%2B%2B%2C-C-%2C-.NET%2C-etc---tp25063175p25077083.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org