+1
2017-07-11 10:30 GMT-03:00 Joern Kottmann <kottm...@gmail.com>: > Hello, > > right, very good point, I also think that it is very important to load > a model in one from the classpath. > > I propose we have the following setup: > - One jar contains one or multiple model packages (thats the zip container) > - A model name itself should be kind of unique e.g. eng-ud-token.bin > - A user loads the model via: new > SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream > gets then closed properly > > > Lets take away three things from this discussion: > 1) Store the data in a place where the community can access it > 2) Offer models on our download page similar as it is done today on > the SourceForge page > 3) Release models packed inside a jar file via maven central > > Jörn > > > > > > > > On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu > <aliaksa...@autayeu.com> wrote: > > To clarify on models and jars. > > > > Putting model inside jar might not be a good idea. I mean here things > like > > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are > "jars" > > already in a sense. We're good. However, current packaging and metadata > > might not be very classpath friendly. > > > > The use case I have in mind is being able to add needed models as > > dependencies and load them by writing a line of code. For this case > having > > all models in a root with the same name might not be very convenient. > Same > > goes for manifest. The name "manifest.properties" is quite generic and > it's > > not too far-fetched to see some clashes because some other lib also > > manifests something. It might be better to allow for some flexibility and > > to adhere to classpath conventions. For example, having manifests in > > something like org/apache/opennlp/models/manifest.properties. Or > > opennlp/tools/manifest.properties. And perhaps even allowing to > reference a > > model in the manifest, so the model can be put elsewhere. Just in case > > there are several custom models of the same kind for different pipelines > in > > the same app. For example, processing queries with one pipeline - one set > > of models - and processing documents with another pipeline - another set > of > > models. In this case allowing for different classpaths is needed. > > > > Perhaps to illustrate my thinking, something like this (which still > keeps a > > lot of possibilities open): > > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps > contains > > a line with something like model = > > /opennlp/tools/sentdetect/model/sent.model) > > en-sent.bin/opennlp/tools/sentdetect/model/sent.model > > > > This allows including en-sent.bin as dependency. And then doing something > > like > > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we > want > > default models in this way. Seems verbose enough to allow for some safety > > through explicitness. That's if we want any defaults at all. > > Or something like: > > SentenceModel sdm = > > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest. > properties"); > > Or > > SentenceModel sdm = > > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent. > model"); > > Or more in-line with a current style: > > SentenceModel sdm = new > > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though > here > > we commit to interpreting String as classpath reference. That's why I'd > > prefer more explicit method names. > > Or leave dealing with resources to the users, leave current code intact > and > > provide only packaging and distribution: > > SentenceModel sdm = new > > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or > > model")); > > > > > > And to add to model metadata also F1\accuracy (at least CV-based, for > > example 10-fold) for quick reference or quick understanding of what that > > model is capable of. Could be helpful for those with a bunch of models > > around. And for others as well to have a better insight about the model > in > > question. > > > > > > > > On 11 July 2017 at 06:37, Chris Mattmann <mattm...@apache.org> wrote: > > > >> Hi, > >> > >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI > to > >> override an > >> internal classpath dependency. This is for people in environments who > want > >> a sensible > >> / delivered internal classpath default and the ability for run-time, non > >> zipped up/messing > >> with JAR file override. Think about people who are using OpenNLP in both > >> Java/Python > >> environments as an example. > >> > >> Cheers, > >> Chris > >> > >> > >> > >> > >> On 7/11/17, 3:25 AM, "Joern Kottmann" <kottm...@gmail.com> wrote: > >> > >> I would not change the CLI to load models from jar files. I never > used > >> or saw a command line tool that expects a file as an input and would > >> then also load it from inside a jar file. It will be hard to > >> communicate how that works precisely in the CLI usage texts and this > >> is not a feature anyone would expect to be there. The intention of > the > >> CLI is to give users the ability to quickly test OpenNLP before they > >> integrate it into their software and to train and evaluate models > >> > >> Users who for some reason have a jar file with a model inside can > just > >> write "unzip model.jar". > >> > >> After all I think this is quite a bit of complexity we would need > to > >> add for it and it will have very limited use. > >> > >> The use case of publishing jar files is to make the models easily > >> available to people who have a build system with dependency > >> management, they won't have to download models manually, and when > they > >> update OpenNLP then can also update the models with a version string > >> change. > >> > >> For the command line "quick start" use case we should offer the > models > >> on a download page as we do today. This page could list both, the > >> download link and the maven dependency. > >> > >> Jörn > >> > >> On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org> > >> wrote: > >> > We need to address things such as sharing the evaluation results > and > >> how to > >> > reproduce the training. > >> > > >> > There are several possibilities for that, but there are points to > >> consider: > >> > > >> > Will we store the model itself in a SCM repository or only the > code > >> that > >> > can build it? > >> > Will we deploy the models to a Maven Central repository? It is > good > >> for > >> > people using the Java API but not for command line interface, > should > >> we > >> > change the CLI to handle models in the classpath? > >> > Should we keep a copy of the training model or always download > from > >> the > >> > original provider? We can't guarantee that the corpus will be > there > >> > forever, not only because it changed license, but simple because > the > >> > provider is not keeping the server up anymore. > >> > > >> > William > >> > > >> > > >> > > >> > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottm...@gmail.com>: > >> > > >> >> Hello all, > >> >> > >> >> since Apache OpenNLP 1.8.1 we have a new language detection > >> component > >> >> which like all our components has to be trained. I think we > should > >> >> release a pre-build model for it trained on the Leipzig corpus. > This > >> >> will allow the majority of our users to get started very quickly > >> with > >> >> language detection without the need to figure out on how to train > >> it. > >> >> > >> >> How should this project release models? > >> >> > >> >> Jörn > >> >> > >> > >> > >> > >> >