Re: Releasing a Language Detection Model

Jeff Zemerick Mon, 10 Jul 2017 14:42:07 -0700

+1 to an opennlp-models jar on Maven Central that contains the models.
+1 to having the models available for download separately (if easily
possible) for users who know what they want.
+1 to having the training data shared somewhere with scripts to generate
the models. It will help protect against losing data as William mentioned.
I don't think we should depend on others to reliably host the data. I'll
volunteer to help script the model generation to run on a fleet of EC2
instances if it helps.


If the user does not provide a model to use on the CLI, can the CLI tools
look on the classpath for a model whose name fits the needed model (like
en-ner-person.bin) and if found use it automatically?

Jeff



On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <[email protected]> wrote:

> +1. In terms of releasing models, maybe an opennlp-models package, and then
> using Maven structure of src/main/resources/<package prefix dirs>/*.bin for
> putting the models.
>
> Then using an assembly descriptor to compile the above into a *-bin.jar?
>
> Cheers,
> Chris
>
>
>
>
> On 7/10/17, 4:09 PM, "Joern Kottmann" <[email protected]> wrote:
>
>     My opinion about this is that we should offer the model as maven
>     dependency for users who just want to use it in their projects, and
>     also offer models for download for people to quickly try out OpenNLP.
>     If the models can be downloaded, a new users could very quickly test
>     it via the command line.
>
>     I don't really have any thoughts yet on how we should organize it, it
>     would probably be nice to have some place where we can share all the
>     training data, and then have the scripts to produce the models checked
>     in. It should be easy to retrain all the models in case we do a major
>     release.
>
>     In case a corpus is vanishing we should drop support for it, must be
>     obsolete then.
>
>     Jörn
>
>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <[email protected]>
> wrote:
>     > We need to address things such as sharing the evaluation results and
> how to
>     > reproduce the training.
>     >
>     > There are several possibilities for that, but there are points to
> consider:
>     >
>     > Will we store the model itself in a SCM repository or only the code
> that
>     > can build it?
>     > Will we deploy the models to a Maven Central repository? It is good
> for
>     > people using the Java API but not for command line interface, should
> we
>     > change the CLI to handle models in the classpath?
>     > Should we keep a copy of the training model or always download from
> the
>     > original provider? We can't guarantee that the corpus will be there
>     > forever, not only because it changed license, but simple because the
>     > provider is not keeping the server up anymore.
>     >
>     > William
>     >
>     >
>     >
>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <[email protected]>:
>     >
>     >> Hello all,
>     >>
>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> component
>     >> which like all our components has to be trained. I think we should
>     >> release a pre-build model for it trained on the Leipzig corpus. This
>     >> will allow the majority of our users to get started very quickly
> with
>     >> language detection without the need to figure out on how to train
> it.
>     >>
>     >> How should this project release models?
>     >>
>     >> Jörn
>     >>
>
>
>
>

Re: Releasing a Language Detection Model

Reply via email to