Thanks that is understood.
My application is a bit special in the way that I need both an indexed
field with standard tokenization and an unindexed but stored field of
sentences. Both must be present for each document.
I could possibly do with PatternTokenizer, but that is of course, less
accurate than e.g., wrapping OpenNLP sentence splitter in a lucene
Tokenizer.
On 23/09/2015 16:23, Doug Turnbull wrote:
Sentence recognition is usually an NLP problem. Probably best handled
outside of Solr. For example, you probably want to train and run a sentence
recognition algorithm, inject a sentence delimiter, then use that delimiter
as the basis for tokenization.
More info on sentence recognition
http://opennlp.apache.org/documentation/manual/opennlp.html
On Wed, Sep 23, 2015 at 11:18 AM, Ziqi Zhang <ziqi.zh...@sheffield.ac.uk>
wrote:
Hi
I need a special kind of 'token' which is a sentence, so I need a
tokenizer that splits texts into sentences.
I wonder if there is already such or similar implementations?
If I have to implement it myself, I suppose I need to implement a subclass
of Tokenizer. Having looked at a few existing implementations, it does not
look very straightforward how to do it. A few pointers would be highly
appreciated.
Many thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Ziqi Zhang
Research Associate
Department of Computer Science
University of Sheffield
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org