Hi again,

All these problems are solved as per this issue:

https://issues.apache.org/jira/browse/OPENNLP-717

Only one issue remains: The requirement to add -factory parameter for
the -featuregen parameter to work and its backing-off to default
features without warning if the -factory param is not used.

Thanks,

Rodrigo

On Sat, Oct 4, 2014 at 12:53 AM, Rodrigo Agerri <[email protected]> wrote:
> Hi,
>
> As a followed up, it turns out that currently we can provide a feature
> generator via -featuregen parameter if you provide a subclass via the
> -factory parameter only. I do not know if that is intended. Also, I
> have noticed a very weird behaviour: I pass several descriptors via
> CLI (starting with token features only, then adding tokenclass, etc.)
> and it all goes well until I add either the Prefix or the
> SuffixFeatureGenerator on which the performance drops alarmingly to
> 49.65 F1 when prefix and suffix are added to the default descriptor:
>
> bin/opennlp TokenNameFinderTrainer -featuregen bigram.xml -factory
> opennlp.tools.namefind.TokenNameFinderFactory -sequenceCodec BIO
> -params lang/ml/PerceptronTrainerParams.txt -lang nl -model test.bin
> -data ~/experiments/nerc/opennlp/data/nl/conll2002/nl_opennlp.testa.train
>
> I get this behaviour with all conll03 and conll02 four datasets.
>
> <generators>
>   <cache>
>     <generators>
>       <window prevLength = "2" nextLength = "2">
>         <tokenclass/>
>       </window>
>       <window prevLength = "2" nextLength = "2">
>         <token/>
>       </window>
>       <definition/>
>       <prevmap/>
>       <bigram/>
>       <sentence begin="true" end="false"/>
>       <prefix/>
>       <suffix/>
>     </generators>
>   </cache>
> </generators>
>
> Cheers,
>
> Rodrigo
>
> On Fri, Oct 3, 2014 at 5:55 PM, Rodrigo Agerri <[email protected]> wrote:
>> Hi Jörn,
>>
>> On Fri, Oct 3, 2014 at 12:40 PM, Jörn Kottmann <[email protected]> wrote:
>>>
>>> There are two things you need to do:
>>> 1. Implement the feature generators
>>> - Implement AdaptiveFeatureGenerator or extend CustomFeatureGenerator if you
>>> need to pass parameters to it
>>
>> OK, lets say I have this descriptor:
>>
>> <generators>
>>   <cache>
>>     <generators>
>>       <window prevLength="2" nextLength="2">
>>         <token />
>>       </window>
>>       <custom 
>> class="es.ehu.si.ixa.pipe.nerc.features.Prefix34FeatureGenerator"
>> />
>>     </generators>
>>   </cache>
>> </generators>
>>
>> Now, if I understand correctly the implementation (and your comments):
>>
>> 1. I should just create a Prefix34FeatureGenerator class extending
>> FeatureGeneratorAdapter.
>> 2. If I wanted to pass parameters, e.g. descriptor attributes, then I
>> should extend CustomFeatureGenerator.
>> 3. If I load such descriptor as argument of -featuregen in the CLI,
>> the CLI should complain if such class is not in the classpath, I
>> guess. If it is in the classpath, then it should use the custom
>> generator.
>> 4. As it is now, no matter what value you pass to the -featuregen, it
>> always train the default features. It does not complain even if the
>> custom feature generator is not well-formed. Even if I only pass the
>> token features, it still loads the default generator. With version
>> 1.5.3 it works fine though. I am looking into it, but any hints
>> welcome :)
>> 5. When I do this programatically, e.g., load the featuregenerator
>> descriptor to an extension of the TokenNameFinderFactory, it seems to
>> load the custom generators,  the GeneratorFactory loads the descriptor
>> I pass, e.g., if only tokens then it trains successfully only with
>> tokens. However, if I pass a custom generator, it does not complain,
>> it trains and the performance drops to 40 F1. For the record, I build
>> the descriptor programatically like this
>>
>>  Element prefixFeature = new Element("custom");
>>  prefixFeature.setAttribute("class", 
>> Prefix34FeatureGenerator.class.getName());
>>  generators.addContent(prefixFeature);
>>
>> and then the GeneratorFactory does get it without errors.
>>
>>> 2.//Implement support for load and serialize the data they need
>>> - This class should implement SerializableArtifact
>>> - And if you want to load use it the Feature Generator should implement
>>> ArtifactToSerializerMapper, that one tells
>>> the loader which class to use to read the data file
>>
>> This is only for the clustering features resources and such, I guess.
>>
>>> The above is the procedure you should use if you want to have a real custom
>>> feature generator which is not part of
>>> the OpenNLP Tools jar.
>>
>> Yes, what I do is include opennlp as maven dependency in an uber jar,
>> e.g., with all classes inside, including opennlp and my custom feature
>> generators. The classpath should be ok in this case, but I still
>> cannot make them work.
>>
>>>
>>>> 6.*Some*  of the new features work. If an Element name in the
>>>> descriptor does not match in the GeneratorFactory, then the
>>>> TokenNameFinderFactory.createFeatureGenerators() gives a null and the
>>>> TokenNameFinderFactory.createContextGenerator() automatically stops
>>>> the feature creation and goes for the
>>>> NameFinderME.createFeatureGenerator().
>>>> Is this the desired behaviour? Perhaps we could add a log somewhere?
>>>> To inform of the backoff to the default features if one descriptor
>>>> element does not match?
>>>
>>>
>>> That sounds really bad. If there is a problem in the mapping it should fail
>>> hard and throw an
>>> exception. The user should be forced to decide by himself what do to, either
>>> fix his descriptor
>>> or use defaults.
>>
>> I can open an issue and look into it.
>>
>>> The idea is that we always use the xml descriptor to define the feature
>>> generation, that way we can have different
>>> configurations without changing the OpenNLP code itself, and don't need
>>> special user code to integrate a
>>> customized name finder model. If a model makes use of external classes these
>>> of course need to be on the classpath
>>> since we can't ship them as a part of the model.
>>
>> OK, but I think what I did above is what you meant, is it not?
>>
>> Thanks,
>>
>> Rodrigo

Reply via email to