On 10/03/2014 11:58 AM, Rodrigo Agerri wrote:
I have implemented a number of new features for the name finder. These
include Brown clusters features (duplicated per Brown path for each
feature activated involving a token) and Clark cluster features
(similar to the WordClusterFeatureGenerator currently available) among
other local extra features which interact well with the clustering
ones.

I think it will be nice to include them before the new release. I will
open issues about each of them. What do you think?

Yes please open issues for them. It would be really nice to receive them as a contribution.

There are two things you need to do:
1. Implement the feature generators
- Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you need to pass parameters to it

2.//Implement support for load and serialize the data they need
- This class should implement SerializableArtifact
- And if you want to load use it the Feature Generator should implement ArtifactToSerializerMapper, that one tells
the loader which class to use to read the data file

The above is the procedure you should use if you want to have a real custom feature generator which is not part of
the OpenNLP Tools jar.

When you contribute it, things are slightly different. You should add a XmlFeatureGeneratorFactory inside the GeneratorFactory class. This factory creates the feature generator based on a defined xml element inside the descriptor.

6.*Some*  of the new features work. If an Element name in the
descriptor does not match in the GeneratorFactory, then the
TokenNameFinderFactory.createFeatureGenerators() gives a null and the
TokenNameFinderFactory.createContextGenerator() automatically stops
the feature creation and goes for the
NameFinderME.createFeatureGenerator().
Is this the desired behaviour? Perhaps we could add a log somewhere?
To inform of the backoff to the default features if one descriptor
element does not match?

That sounds really bad. If there is a problem in the mapping it should fail hard and throw an exception. The user should be forced to decide by himself what do to, either fix his descriptor
or use defaults.

The steps 4 and 5 you describe should not be necessary to add new feature generators.

The idea is that we always use the xml descriptor to define the feature generation, that way we can have different configurations without changing the OpenNLP code itself, and don't need special user code to integrate a customized name finder model. If a model makes use of external classes these of course need to be on the classpath
since we can't ship them as a part of the model.

HTH,
Jörn

Reply via email to