+1 to an opennlp-models jar on Maven Central that contains the models. +1 to having the models available for download separately (if easily possible) for users who know what they want. +1 to having the training data shared somewhere with scripts to generate the models. It will help protect against losing data as William mentioned. I don't think we should depend on others to reliably host the data. I'll volunteer to help script the model generation to run on a fleet of EC2 instances if it helps.
If the user does not provide a model to use on the CLI, can the CLI tools look on the classpath for a model whose name fits the needed model (like en-ner-person.bin) and if found use it automatically? Jeff On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <mattm...@apache.org> wrote: > +1. In terms of releasing models, maybe an opennlp-models package, and then > using Maven structure of src/main/resources/<package prefix dirs>/*.bin for > putting the models. > > Then using an assembly descriptor to compile the above into a *-bin.jar? > > Cheers, > Chris > > > > > On 7/10/17, 4:09 PM, "Joern Kottmann" <kottm...@gmail.com> wrote: > > My opinion about this is that we should offer the model as maven > dependency for users who just want to use it in their projects, and > also offer models for download for people to quickly try out OpenNLP. > If the models can be downloaded, a new users could very quickly test > it via the command line. > > I don't really have any thoughts yet on how we should organize it, it > would probably be nice to have some place where we can share all the > training data, and then have the scripts to produce the models checked > in. It should be easy to retrain all the models in case we do a major > release. > > In case a corpus is vanishing we should drop support for it, must be > obsolete then. > > Jörn > > On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> > wrote: > > We need to address things such as sharing the evaluation results and > how to > > reproduce the training. > > > > There are several possibilities for that, but there are points to > consider: > > > > Will we store the model itself in a SCM repository or only the code > that > > can build it? > > Will we deploy the models to a Maven Central repository? It is good > for > > people using the Java API but not for command line interface, should > we > > change the CLI to handle models in the classpath? > > Should we keep a copy of the training model or always download from > the > > original provider? We can't guarantee that the corpus will be there > > forever, not only because it changed license, but simple because the > > provider is not keeping the server up anymore. > > > > William > > > > > > > > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottm...@gmail.com>: > > > >> Hello all, > >> > >> since Apache OpenNLP 1.8.1 we have a new language detection > component > >> which like all our components has to be trained. I think we should > >> release a pre-build model for it trained on the Leipzig corpus. This > >> will allow the majority of our users to get started very quickly > with > >> language detection without the need to figure out on how to train > it. > >> > >> How should this project release models? > >> > >> Jörn > >> > > > >