Hi,

as you are using Elasticsearch, there is no need to implement an Analyzer 
instance. In general, this is never needed in Lucene, too, as there is the 
class CustomAnalyzer that uses a builder pattern to construct an analyzer like 
Elasticsearch or Solr are doing.

For your use-case you need to implement a custom Tokenizer and/or several 
TokenFilters. In addition you need to create the corresponding factory classes 
and bundle everything as an Elasticsearch plugin. I'd suggest to ask on the 
Elasticsearch mailing lists about this. After that you can define your analyzer 
in the Elasticsearch mapping/index config.

The Tokenizer and TokenFilters can be implemented, e.g. like Robert Muir was 
telling you. The sentence stuff can be done as a segmenting tokenizer subclass. 
Keep in mind, that many tasks can be done with already existing TokenFilters 
and/or Tokenizers.

Lucene has no index support for POS tags, they are only used in the analysis 
chain. To somehow add them to the index, you may use TokenFilters as last stage 
that adds the POS tags to the term (e.g., term "Windmill", pos "subject" could 
be combined in the last TokenFilter to a term called "Windmill#subject" and 
indexed like that). For keeping track of POS tags during the analysis (between 
the tokenfilters and tokenizers), you may need to define custom attributes.

Check the UIMA analysis module for more information how to do this.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Christian Becker [mailto:christian.frei...@gmail.com]
> Sent: Monday, May 29, 2017 2:37 PM
> To: general@lucene.apache.org
> Subject: Developing experimental "more advanced" analyzers
> 
> Hi There,
> 
> I'm new to lucene (in fact im interested in ElasticSearch but in this case
> its related to lucene) and I want to make some experiments with some
> enhanced analyzers.
> 
> Indeed I have an external linguistic component which I want to connect to
> Lucene / EleasticSearch. So before I'm producing a bunch of useless code, I
> want to make sure that I'm going the right way.
> 
> The linguistic component needs at least a whole sentence as Input (at best
> it would be the whole text at once).
> 
> So as far as I can see I would need to create a custom Analyzer and
> overrride "createComponents" and "normalize".
> 
> Is that correct or am I on the wrong track?
> 
> Bests
> Chris

Reply via email to