Hi all,
as proposed earlier I think we should go ahead and define/implement the
training
parameters format and classes. We need to define the format and decide
how we change
our current training implementation.
I believe it should be part of OpenNLP Tools and not the maxent package,
for two reasons, first it should be possible to define parameters for
different models,
where maxent only deals with one model at a time, and the new API does
not depend on
maxent (which will be replaced with opennlp-ml).
The parser contains multiple models, maybe someone wants to train one of
them
with perceptron and the other with maxent, or experiment with cutoff and
iterations
for a certain model.
I propose that we simply use a java properties file.
For the name finder it could look like this:
Algorithm=MAXENT
Iterations=150
Cutoff=4
Or for the parser:
build.Algorithm=MAXENT
build.Iterations=180
build.Threads=4
check.Algorithm=MAXENT
check.Iterations=120
check.Threads=2
tagger.Algorithm=PERCEPTRON
tagger.Iterations=130
tagger.Cutoff=0
The maxent package will provide a small util which can validate the
parameters for a certain algorithm
and then do the training according to the parameters.
That could look like this:
isValid(Map<String, String> params);
train(Map<String, String> params, EventStream events)
Depending on the model which should be trained, the Training Parameters
can be reduced by
providing a name space.
To train the build model in the sample above the following would be done
TrainingParamters.getParams("build");
that return a Map<String, String> with this content:
Algorithm=MAXENT
Iterations=180
Threads=4
and the map is passed to the train method to train the model based on
the provided event stream.
Any opinions ?
I am +1 to do this change for 1.5.2, but we need to maintain strict
backward compatibilty.
Jörn