I'm sorry - I didn't write down, that my intention is to have linguistic annotations like stems and maybe part of speech information. For sure, tokenization is one of the things I want to do.
2017-05-29 19:02 GMT+02:00 Robert Muir <rcm...@gmail.com>: > On Mon, May 29, 2017 at 8:36 AM, Christian Becker > <christian.frei...@gmail.com> wrote: > > Hi There, > > > > I'm new to lucene (in fact im interested in ElasticSearch but in this > case > > its related to lucene) and I want to make some experiments with some > > enhanced analyzers. > > > > Indeed I have an external linguistic component which I want to connect to > > Lucene / EleasticSearch. So before I'm producing a bunch of useless > code, I > > want to make sure that I'm going the right way. > > > > The linguistic component needs at least a whole sentence as Input (at > best > > it would be the whole text at once). > > > > So as far as I can see I would need to create a custom Analyzer and > > overrride "createComponents" and "normalize". > > > > There is a base class for tokenizers that want to see > sentences-at-a-time in order to divide into words: > > https://github.com/apache/lucene-solr/blob/master/ > lucene/analysis/common/src/java/org/apache/lucene/analysis/util/ > SegmentingTokenizerBase.java#L197-L201 > > There are two examples that use it in the test class: > > https://github.com/apache/lucene-solr/blob/master/ > lucene/analysis/common/src/test/org/apache/lucene/analysis/util/ > TestSegmentingTokenizerBase.java#L145 >