If you used our products which have Elastic plugins,POS, Stems and Leminisation it would be much easier.
Kind Regards Chris VP International E: cbr...@basistech.com T: +44 208 622 2900 M: +44 7796946934 USA Number: +16173867107 Lakeside House, 1 Furzeground Way, Stockley Park, Middlesex, UB11 1BD, UK On 29 May 2017 at 19:42, Christian Becker <christian.frei...@gmail.com> wrote: > I'm sorry - I didn't write down, that my intention is to have linguistic > annotations like stems and maybe part of speech information. For sure, > tokenization is one of the things I want to do. > > 2017-05-29 19:02 GMT+02:00 Robert Muir <rcm...@gmail.com>: > > > On Mon, May 29, 2017 at 8:36 AM, Christian Becker > > <christian.frei...@gmail.com> wrote: > > > Hi There, > > > > > > I'm new to lucene (in fact im interested in ElasticSearch but in this > > case > > > its related to lucene) and I want to make some experiments with some > > > enhanced analyzers. > > > > > > Indeed I have an external linguistic component which I want to connect > to > > > Lucene / EleasticSearch. So before I'm producing a bunch of useless > > code, I > > > want to make sure that I'm going the right way. > > > > > > The linguistic component needs at least a whole sentence as Input (at > > best > > > it would be the whole text at once). > > > > > > So as far as I can see I would need to create a custom Analyzer and > > > overrride "createComponents" and "normalize". > > > > > > > There is a base class for tokenizers that want to see > > sentences-at-a-time in order to divide into words: > > > > https://github.com/apache/lucene-solr/blob/master/ > > lucene/analysis/common/src/java/org/apache/lucene/analysis/util/ > > SegmentingTokenizerBase.java#L197-L201 > > > > There are two examples that use it in the test class: > > > > https://github.com/apache/lucene-solr/blob/master/ > > lucene/analysis/common/src/test/org/apache/lucene/analysis/util/ > > TestSegmentingTokenizerBase.java#L145 > > >