Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Jörn Kottmann Mon, 06 Oct 2014 14:20:07 -0700

Hello,

now I understand it much better. Previously it was
only possible to provide the XML descriptor as bytes or an instance
of a Feature Generator to the train method.


The method which accepted the XML descriptor then instantiated it and
called the train method which takes a Feature Generator, after the
training was done it updated the model so that is included the XML
descriptor artifact.

Through the recent change, the concept was changed and the train method
taking the Feature Generator was changed to instead accept the
TokenNameFinderFactory. Anyway, since then it doesn't write the XML
descriptor anymore into the model.

I see two ways to fix this:
- The way you suggested, by extracting the XMÖ descriptor from the
TokenNameFinderFactory
- Or by returning the XML descriptor as part of the
TokenNameFinder.getResources() method. 

The second option is probably better, because it gives more control to
the TokenNameFinderFactory. It would be possible to change the format of
the feature generator description with a different factory
implementation. And it might simplify the TokenNameFinderModel
implementation, since most code dealing with the feature generation
could be removed.

Any opinions?

Jörn

On Mon, 2014-10-06 at 17:57 +0200, Rodrigo Agerri wrote:
> Hi,
> 
> On Mon, Oct 6, 2014 at 5:41 PM, Jörn Kottmann <[email protected]> wrote:
> >
> > Isn't that how it is implemented today? The feature generators can't be
> > shared
> > and therefore we have the createFeatureGenerators method in the
> > TokenNameFinderFactory
> > which creates a new feature generator every time one is needed.
> > That one tries to read the xml descriptor from the model and creates the
> > feature generators.
> 
> Yes, but with one exception: it all goes well until it arrives to line
> 361 of NameFinderME:
> 
> return new TokenNameFinderModel(languageCode, nameFinderModel,
> beamSize, null, factory.getResources(), manifestInfoEntries,
> factory.getSequenceCodec());
> 
> that "null" parameter is the featureGenerator. The init() method in
> the TokenNameFinderModel class get that null and returns the default
> feature generator.
> 
> what is needed is to pass the featureGenerator created by the
> TokenNameFinder.createContext() as a parameter. That is why I added a
> getter in the TokenNameFinderFactory for the field private byte[]
> featureGeneratorBytes. I just add it and in to create the
> TokenNameFinderModel above in NameFinderME I say:
> 
>  return new TokenNameFinderModel(languageCode, nameFinderModel,
> beamSize, factory.getFeatureGenerator(), factory.getResources(),
> manifestInfoEntries, factory.getSequenceCodec());
> 
> and it all works as expected.
> 
> > I will try to reproduce the bug you see.
> >
> > How can I do that?
> >
> > First train a model with this command:
> 
> > bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
> > opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
> > -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
> > -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train
> >
> > and this feature generator config:
> > <generators>
> >   <cache>
> >     <generators>
> >       <window prevLength = "2" nextLength = "2">
> >         <tokenclass/>
> >       </window>
> >       <window prevLength = "2" nextLength = "2">
> >         <token/>
> >       </window>
> >       <definition/>
> >       <prevmap/>
> >       <bigram/>
> >       <sentence begin="true" end="false"/>
> >       <prefix/>
> >       <suffix/>
> >     </generators>
> >   </cache>
> > </generators>
> >
> > Did you use the command line tool for the evaluation too?
> > Maybe you can post the command for that.
> 
> Yes, and then try to train with the default featureGenerator in the
> lang/en/namefind directory.
> 
> bin/opennlp TokenNameFinderEvaluator -model test.bin -data
> ~/experiments/nerc/opennlp/data/nl/conll2002/opennlp-nl.testb
> 
> Cheers,
> 
> Rodrigo

Re: [opennlp-dev] TokenNameFinderFactory new features and extension

Reply via email to