Hi again, All these problems are solved as per this issue:
https://issues.apache.org/jira/browse/OPENNLP-717 Only one issue remains: The requirement to add -factory parameter for the -featuregen parameter to work and its backing-off to default features without warning if the -factory param is not used. Thanks, Rodrigo On Sat, Oct 4, 2014 at 12:53 AM, Rodrigo Agerri <[email protected]> wrote: > Hi, > > As a followed up, it turns out that currently we can provide a feature > generator via -featuregen parameter if you provide a subclass via the > -factory parameter only. I do not know if that is intended. Also, I > have noticed a very weird behaviour: I pass several descriptors via > CLI (starting with token features only, then adding tokenclass, etc.) > and it all goes well until I add either the Prefix or the > SuffixFeatureGenerator on which the performance drops alarmingly to > 49.65 F1 when prefix and suffix are added to the default descriptor: > > bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory > opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO > -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin > -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train > > I get this behaviour with all conll03 and conll02 four datasets. > > <generators> > <cache> > <generators> > <window prevLength = "2" nextLength = "2"> > <tokenclass/> > </window> > <window prevLength = "2" nextLength = "2"> > <token/> > </window> > <definition/> > <prevmap/> > <bigram/> > <sentence begin="true" end="false"/> > <prefix/> > <suffix/> > </generators> > </cache> > </generators> > > Cheers, > > Rodrigo > > On Fri, Oct 3, 2014 at 5:55 PM, Rodrigo Agerri <[email protected]> wrote: >> Hi Jörn, >> >> On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <[email protected]> wrote: >>> >>> There are two things you need to do: >>> 1. Implement the feature generators >>> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you >>> need to pass parameters to it >> >> OK, lets say I have this descriptor: >> >> <generators> >> <cache> >> <generators> >> <window prevLength="2" nextLength="2"> >> <token /> >> </window> >> <custom >> class="es.ehu.si.ixa.pipe.nerc.features.Prefix34FeatureGenerator" >> /> >> </generators> >> </cache> >> </generators> >> >> Now, if I understand correctly the implementation (and your comments): >> >> 1. I should just create a Prefix34FeatureGenerator class extending >> FeatureGeneratorAdapter. >> 2. If I wanted to pass parameters, e.g. descriptor attributes, then I >> should extend CustomFeatureGenerator. >> 3. If I load such descriptor as argument of -featuregen in the CLI, >> the CLI should complain if such class is not in the classpath, I >> guess. If it is in the classpath, then it should use the custom >> generator. >> 4. As it is now, no matter what value you pass to the -featuregen, it >> always train the default features. It does not complain even if the >> custom feature generator is not well-formed. Even if I only pass the >> token features, it still loads the default generator. With version >> 1.5.3 it works fine though. I am looking into it, but any hints >> welcome :) >> 5. When I do this programatically, e.g., load the featuregenerator >> descriptor to an extension of the TokenNameFinderFactory, it seems to >> load the custom generators, the GeneratorFactory loads the descriptor >> I pass, e.g., if only tokens then it trains successfully only with >> tokens. However, if I pass a custom generator, it does not complain, >> it trains and the performance drops to 40 F1. For the record, I build >> the descriptor programatically like this >> >> Element prefixFeature = new Element("custom"); >> prefixFeature.setAttribute("class", >> Prefix34FeatureGenerator.class.getName()); >> generators.addContent(prefixFeature); >> >> and then the GeneratorFactory does get it without errors. >> >>> 2.//Implement support for load and serialize the data they need >>> - This class should implement SerializableArtifact >>> - And if you want to load use it the Feature Generator should implement >>> ArtifactToSerializerMapper, that one tells >>> the loader which class to use to read the data file >> >> This is only for the clustering features resources and such, I guess. >> >>> The above is the procedure you should use if you want to have a real custom >>> feature generator which is not part of >>> the OpenNLP Tools jar. >> >> Yes, what I do is include opennlp as maven dependency in an uber jar, >> e.g., with all classes inside, including opennlp and my custom feature >> generators. The classpath should be ok in this case, but I still >> cannot make them work. >> >>> >>>> 6.*Some* of the new features work. If an Element name in the >>>> descriptor does not match in the GeneratorFactory, then the >>>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the >>>> TokenNameFinderFactory.createContextGenerator() automatically stops >>>> the feature creation and goes for the >>>> NameFinderME.createFeatureGenerator(). >>>> Is this the desired behaviour? Perhaps we could add a log somewhere? >>>> To inform of the backoff to the default features if one descriptor >>>> element does not match? >>> >>> >>> That sounds really bad. If there is a problem in the mapping it should fail >>> hard and throw an >>> exception. The user should be forced to decide by himself what do to, either >>> fix his descriptor >>> or use defaults. >> >> I can open an issue and look into it. >> >>> The idea is that we always use the xml descriptor to define the feature >>> generation, that way we can have different >>> configurations without changing the OpenNLP code itself, and don't need >>> special user code to integrate a >>> customized name finder model. If a model makes use of external classes these >>> of course need to be on the classpath >>> since we can't ship them as a part of the model. >> >> OK, but I think what I did above is what you meant, is it not? >> >> Thanks, >> >> Rodrigo
