There is a custom xml element where it can load a user defined class for feature generation.
So you would add an element like this: <custom class="com.x.y.AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator""/> https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen I think we should remove the deprecated training methods so it is no longer possible to train models which can't be loaded. Jörn On Mon, Mar 7, 2016 at 6:45 PM, Cohan Sujay Carlos <co...@aiaioo.com> wrote: > Dear Rodrigo, > > Thank you for the informative reply. > > I just wanted to say I feel there is a use-case that the new constructor > still does not support. Let me explain with an example. > > Let's first take the example of brown-feature.xml, which is defined as ... > > <generators> > <cache> > <generators> > <window prevLength = "2" nextLength = "2"> > <token/> > </window> > <window prevLength = "2" nextLength = "2"> > <brownclustertoken dict="brownBllipClusters" /> > </window> > </generators> > </cache> > </generators> > > ... In this feature generator, I believe "window" maps to the > WindowFeatureGenerator > < > https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html > > > and "token" maps to TokenFeatureGenerator > < > https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/TokenFeatureGenerator.html > > > . > > It's clear that we can create new feature generators that are combinations > of existing feature generators. > > However, let's say I have a task / language where none of the existing > feature generators or combinations work very well. > > Say, for example, that I want to create a new feature generator that pulls > out morphemes from agglutinative South Indian languages ... let's call it > "AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator". > > It's not clear how one could create XML tags for this feature generator > using the new constructor. > > The same thing is easy to do programmatically using the old constructors -> > I would just extend the AdaptiveFeatureGenerator > < > https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html > > > . > > So, I was wondering ... are we giving up some API flexibility and > simplicity by removing the constructors that enable me to use subclasses of > AdaptiveFeatureGenerator > < > https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html > > > while > there is no easy way to create something like a > AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator and use > it as a feature generator in the NameFinderME using the new constructor's > XML specification. > > Cohan Sujay Carlos > Aiaioo Labs, +91-77605-80015, http://www.aiaioo.com > > On Mon, Mar 7, 2016 at 4:37 PM, Rodrigo Agerri <rage...@apache.org> wrote: > > > Hi, > > > > You can do all those tasks by using the create method in the > > TokenNameFinderFactory: > > > > > > > http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553&view=markup#l100 > > > > For that you need to: > > > > 1. Provide the name of the factory class you are using, it could be > > the same factory class: TokenNameFinderFactory.class.getName() > > 2. Create an XML descriptor and pass it as a byte[] array > > 3. Load the resources (e.g., clusters) in a resources map consisting > > of the id of the resource and the serializer. > > 4. The sequenceCodec: BIO or BILOU. > > > > There Namefinder documentation was already updated: > > > > > > > http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup > > > > There is sample code to do that in the CLI class: > > > > > > > http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262&view=markup > > > > and to run it from the CLI: > > > > 1. Create an XML feature descriptor, e.g., brown-feature.xml > > > > <generators> > > <cache> > > <generators> > > <window prevLength = "2" nextLength = "2"> > > <token/> > > </window> > > <window prevLength = "2" nextLength = "2"> > > <brownclustertoken dict="brownBllipClusters" /> > > </window> > > </generators> > > </cache> > > </generators> > > > > 2. Put your clustering lexicon(s) in a directory, .e.g, clusters > > 3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml > > -resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang > > en -model brown.bin -data > > ~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8 > > > > If you open the brown.bin model you will see the cluster lexicon > > seralized inside the model. > > > > You can now use it like any other model, the TokenNameFinderFactory > > will read again all the required resources when loading the model in > > the TokenNameFinderME class. > > > > HTH, > > > > R > > > > > > > > > > > > > > On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos <co...@aiaioo.com> > > wrote: > > > Hi, > > > > > > I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated > > > constructors for the class *NameFinderME*: > > > > > > *public NameFinderME(TokenNameFinderModel model, > AdaptiveFeatureGenerator > > > generator, int beamSize, SequenceValidator<String> sequenceValidator);* > > > > > > and > > > > > > > > > *public NameFinderME(TokenNameFinderModel model, > AdaptiveFeatureGenerator > > > generator, int beamSize)* > > > > > > have been removed, along with > > > > > > *public NameFinderME(TokenNameFinderModel model, int beamSize)* > > > > > > The deprecation comments said: > > > > > > @deprecated the beam size is now configured during training time in the > > > trainer parameter file via beamSearch.beamSize > > > > > > and > > > > > > @deprecated Use {@link #NameFinderME(TokenNameFinderModel)} instead and > > use > > > the {@link TokenNameFinderFactory} to configure it. > > > > > > I wanted to point out a few potential problems: > > > > > > 1. The corresponding train methods have not been removed. So, it is > > > possible to train a NameFinderME using a *custom* > > AdaptiveFeatureGenerator > > > class to do feature engineering, but once a model has been so trained, > > > there is no way to load and use the stored model with the same > > > AdaptiveFeatureGenerator. > > > > > > 2. There is still no documentation on the TokenNameFinderFactory which > > is > > > supposed to replace the constructor with the AdaptiveFeatureGenerator. > > > > > > 3. I went over the code of TokenNameFinderFactory and a few places > where > > > it is used and it seemed to be designed for working with an XML > > > specification of feature combinations. I have also in the references > > > included a mailing list conversation that says this class should be > used > > > with an XML file. However, it turns out that custom feature sets for > > > sequential classification are often important, so might we be dropping > > > valuable feature engineering support? > > > > > > Finally, in light of the above, could we keep the deprecated > constructors > > > around until the alternative constructor (using TokenNameFinderFactory) > > > enters into production, and examples and documentation for it become > > widely > > > available? > > > > > > References: > > > > > > On the TokenNameFinderFactory using XML: > > > > > > https://mail-archives.apache.org/mod_mbox/opennlp-dev/201410.mbox/%3CCAKvDkVDfAx5BMvwVOrbvpZm7xV9erRQzrzbCDpfd+Cq6m=x...@mail.gmail.com%3E > > > > > > Relevant JIRA issues: > > > https://issues.apache.org/jira/browse/OPENNLP-718 > > > https://issues.apache.org/jira/browse/OPENNLP-717 > > > > > > Thank you, > > > > > > Cohan Sujay Carlos > > >