Hi Otis I've implemented sentence detection as part of my tokenizer, and it does not extract sentences, but "detecs" EOS (based on several characters from the UNICODE spec). Upon detection, it returns a Token of EOS type. I then have a EOS Filter which can be configured w/ appropriate behavior as to what to do with it for example, set posIncr to 100 on the next token, to avoid phrase/fuzzy searches find matches across sentences, but there are other reasons as well such as highlighting.
So I would, personally, not think of EOS detection as a Tokenizer in and on itself, but rather as a capability of a Tokenizer (Standard?). Shai On Fri, Nov 27, 2009 at 8:07 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hello, > > The contrib/wordnet package contains an AnalyzerUtil class with a method > that extracts sentences from text/String. It is super-simplistic, so > probably not very accurate, but I am wondering if *conceptually* it would > make sense to have a Tokenizer that extracts sentences? I suppose that > means each Token would be a complete sentence. > > Would you say it makes sense to implement sentence detection/extraction as > a Tokenizer? > > Thanks, > Otis > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >