Hello, now I understand it much better. Previously it was only possible to provide the XML descriptor as bytes or an instance of a Feature Generator to the train method.
The method which accepted the XML descriptor then instantiated it and called the train method which takes a Feature Generator, after the training was done it updated the model so that is included the XML descriptor artifact. Through the recent change, the concept was changed and the train method taking the Feature Generator was changed to instead accept the TokenNameFinderFactory. Anyway, since then it doesn't write the XML descriptor anymore into the model. I see two ways to fix this: - The way you suggested, by extracting the XMÖ descriptor from the TokenNameFinderFactory - Or by returning the XML descriptor as part of the TokenNameFinder.getResources() method. The second option is probably better, because it gives more control to the TokenNameFinderFactory. It would be possible to change the format of the feature generator description with a different factory implementation. And it might simplify the TokenNameFinderModel implementation, since most code dealing with the feature generation could be removed. Any opinions? Jörn On Mon, 2014-10-06 at 17:57 +0200, Rodrigo Agerri wrote: > Hi, > > On Mon, Oct 6, 2014 at 5:41 PM, Jörn Kottmann <[email protected]> wrote: > > > > Isn't that how it is implemented today? The feature generators can't be > > shared > > and therefore we have the createFeatureGenerators method in the > > TokenNameFinderFactory > > which creates a new feature generator every time one is needed. > > That one tries to read the xml descriptor from the model and creates the > > feature generators. > > Yes, but with one exception: it all goes well until it arrives to line > 361 of NameFinderME: > > return new TokenNameFinderModel(languageCode, nameFinderModel, > beamSize, null, factory.getResources(), manifestInfoEntries, > factory.getSequenceCodec()); > > that "null" parameter is the featureGenerator. The init() method in > the TokenNameFinderModel class get that null and returns the default > feature generator. > > what is needed is to pass the featureGenerator created by the > TokenNameFinder.createContext() as a parameter. That is why I added a > getter in the TokenNameFinderFactory for the field private byte[] > featureGeneratorBytes. I just add it and in to create the > TokenNameFinderModel above in NameFinderME I say: > > return new TokenNameFinderModel(languageCode, nameFinderModel, > beamSize, factory.getFeatureGenerator(), factory.getResources(), > manifestInfoEntries, factory.getSequenceCodec()); > > and it all works as expected. > > > I will try to reproduce the bug you see. > > > > How can I do that? > > > > First train a model with this command: > > > bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory > > opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO > > -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin > > -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train > > > > and this feature generator config: > > <generators> > > <cache> > > <generators> > > <window prevLength = "2" nextLength = "2"> > > <tokenclass/> > > </window> > > <window prevLength = "2" nextLength = "2"> > > <token/> > > </window> > > <definition/> > > <prevmap/> > > <bigram/> > > <sentence begin="true" end="false"/> > > <prefix/> > > <suffix/> > > </generators> > > </cache> > > </generators> > > > > Did you use the command line tool for the evaluation too? > > Maybe you can post the command for that. > > Yes, and then try to train with the default featureGenerator in the > lang/en/namefind directory. > > bin/opennlp TokenNameFinderEvaluator -model test.bin -data > ~/experiments/nerc/opennlp/data/nl/conll2002/opennlp-nl.testb > > Cheers, > > Rodrigo
