Dear Rodrigo,
Thank you for the informative reply.
I just wanted to say I feel there is a use-case that the new constructor
still does not support. Let me explain with an example.
Let's first take the example of brown-feature.xml, which is defined as ...
<generators>
<cache>
<generators>
<window prevLength = "2" nextLength = "2">
<token/>
</window>
<window prevLength = "2" nextLength = "2">
<brownclustertoken dict="brownBllipClusters" />
</window>
</generators>
</cache>
</generators>
... In this feature generator, I believe "window" maps to the
WindowFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html>
and "token" maps to TokenFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/TokenFeatureGenerator.html>
.
It's clear that we can create new feature generators that are combinations
of existing feature generators.
However, let's say I have a task / language where none of the existing
feature generators or combinations work very well.
Say, for example, that I want to create a new feature generator that pulls
out morphemes from agglutinative South Indian languages ... let's call it
"AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator".
It's not clear how one could create XML tags for this feature generator
using the new constructor.
The same thing is easy to do programmatically using the old constructors ->
I would just extend the AdaptiveFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html>
.
So, I was wondering ... are we giving up some API flexibility and
simplicity by removing the constructors that enable me to use subclasses of
AdaptiveFeatureGenerator
<https://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html>
while
there is no easy way to create something like a
AgglutinativeSouthIndianLanguageMorphologicalSuffixFeatureGenerator and use
it as a feature generator in the NameFinderME using the new constructor's
XML specification.
Cohan Sujay Carlos
Aiaioo Labs, +91-77605-80015, http://www.aiaioo.com
On Mon, Mar 7, 2016 at 4:37 PM, Rodrigo Agerri <[email protected]> wrote:
> Hi,
>
> You can do all those tasks by using the create method in the
> TokenNameFinderFactory:
>
>
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/namefind/TokenNameFinderFactory.java?revision=1712553&view=markup#l100
>
> For that you need to:
>
> 1. Provide the name of the factory class you are using, it could be
> the same factory class: TokenNameFinderFactory.class.getName()
> 2. Create an XML descriptor and pass it as a byte[] array
> 3. Load the resources (e.g., clusters) in a resources map consisting
> of the id of the resource and the serializer.
> 4. The sequenceCodec: BIO or BILOU.
>
> There Namefinder documentation was already updated:
>
>
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?view=markup
>
> There is sample code to do that in the CLI class:
>
>
> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTrainerTool.java?revision=1674262&view=markup
>
> and to run it from the CLI:
>
> 1. Create an XML feature descriptor, e.g., brown-feature.xml
>
> <generators>
> <cache>
> <generators>
> <window prevLength = "2" nextLength = "2">
> <token/>
> </window>
> <window prevLength = "2" nextLength = "2">
> <brownclustertoken dict="brownBllipClusters" />
> </window>
> </generators>
> </cache>
> </generators>
>
> 2. Put your clustering lexicon(s) in a directory, .e.g, clusters
> 3. Train: bin/opennlp TokenNameFinderTrainer -featuregen brown.xml
> -resources clusters/ -params lang/ml/PerceptronTrainerParams.txt -lang
> en -model brown.bin -data
> ~/experiments/nerc/opennlp/en/conll03/en-testb.opennlp -encoding UTF-8
>
> If you open the brown.bin model you will see the cluster lexicon
> seralized inside the model.
>
> You can now use it like any other model, the TokenNameFinderFactory
> will read again all the required resources when loading the model in
> the TokenNameFinderME class.
>
> HTH,
>
> R
>
>
>
>
>
>
> On Mon, Feb 15, 2016 at 7:59 AM, Cohan Sujay Carlos <[email protected]>
> wrote:
> > Hi,
> >
> > I noticed that in the OpenNLP SVM 'trunk', the formerly deprecated
> > constructors for the class *NameFinderME*:
> >
> > *public NameFinderME(TokenNameFinderModel model, AdaptiveFeatureGenerator
> > generator, int beamSize, SequenceValidator<String> sequenceValidator);*
> >
> > and
> >
> >
> > *public NameFinderME(TokenNameFinderModel model, AdaptiveFeatureGenerator
> > generator, int beamSize)*
> >
> > have been removed, along with
> >
> > *public NameFinderME(TokenNameFinderModel model, int beamSize)*
> >
> > The deprecation comments said:
> >
> > @deprecated the beam size is now configured during training time in the
> > trainer parameter file via beamSearch.beamSize
> >
> > and
> >
> > @deprecated Use {@link #NameFinderME(TokenNameFinderModel)} instead and
> use
> > the {@link TokenNameFinderFactory} to configure it.
> >
> > I wanted to point out a few potential problems:
> >
> > 1. The corresponding train methods have not been removed. So, it is
> > possible to train a NameFinderME using a *custom*
> AdaptiveFeatureGenerator
> > class to do feature engineering, but once a model has been so trained,
> > there is no way to load and use the stored model with the same
> > AdaptiveFeatureGenerator.
> >
> > 2. There is still no documentation on the TokenNameFinderFactory which
> is
> > supposed to replace the constructor with the AdaptiveFeatureGenerator.
> >
> > 3. I went over the code of TokenNameFinderFactory and a few places where
> > it is used and it seemed to be designed for working with an XML
> > specification of feature combinations. I have also in the references
> > included a mailing list conversation that says this class should be used
> > with an XML file. However, it turns out that custom feature sets for
> > sequential classification are often important, so might we be dropping
> > valuable feature engineering support?
> >
> > Finally, in light of the above, could we keep the deprecated constructors
> > around until the alternative constructor (using TokenNameFinderFactory)
> > enters into production, and examples and documentation for it become
> widely
> > available?
> >
> > References:
> >
> > On the TokenNameFinderFactory using XML:
> >
> https://mail-archives.apache.org/mod_mbox/opennlp-dev/201410.mbox/%3CCAKvDkVDfAx5BMvwVOrbvpZm7xV9erRQzrzbCDpfd+Cq6m=x...@mail.gmail.com%3E
> >
> > Relevant JIRA issues:
> > https://issues.apache.org/jira/browse/OPENNLP-718
> > https://issues.apache.org/jira/browse/OPENNLP-717
> >
> > Thank you,
> >
> > Cohan Sujay Carlos
>