To clarify on models and jars.

Putting model inside jar might not be a good idea. I mean here things like
bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
already in a sense. We're good. However, current packaging and metadata
might not be very classpath friendly.

The use case I have in mind is being able to add needed models as
dependencies and load them by writing a line of code. For this case having
all models in a root with the same name might not be very convenient. Same
goes for manifest. The name "manifest.properties" is quite generic and it's
not too far-fetched to see some clashes because some other lib also
manifests something. It might be better to allow for some flexibility and
to adhere to classpath conventions. For example, having manifests in
something like org/apache/opennlp/models/manifest.properties. Or
opennlp/tools/manifest.properties. And perhaps even allowing to reference a
model in the manifest, so the model can be put elsewhere. Just in case
there are several custom models of the same kind for different pipelines in
the same app. For example, processing queries with one pipeline - one set
of models - and processing documents with another pipeline - another set of
models. In this case allowing for different classpaths is needed.

Perhaps to illustrate my thinking, something like this (which still keeps a
lot of possibilities open):
en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps contains
a line with something like model =
/opennlp/tools/sentdetect/model/sent.model)
en-sent.bin/opennlp/tools/sentdetect/model/sent.model

This allows including en-sent.bin as dependency. And then doing something
like
SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
default models in this way. Seems verbose enough to allow for some safety
through explicitness. That's if we want any defaults at all.
Or something like:
SentenceModel sdm =
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.properties");
Or
SentenceModel sdm =
SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
Or more in-line with a current style:
SentenceModel sdm = new
SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
we commit to interpreting String as classpath reference. That's why I'd
prefer more explicit method names.
Or leave dealing with resources to the users, leave current code intact and
provide only packaging and distribution:
SentenceModel sdm = new
SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
model"));


And to add to model metadata also F1\accuracy (at least CV-based, for
example 10-fold) for quick reference or quick understanding of what that
model is capable of. Could be helpful for those with a bunch of models
around. And for others as well to have a better insight about the model in
question.



On 11 July 2017 at 06:37, Chris Mattmann <mattm...@apache.org> wrote:

> Hi,
>
> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI to
> override an
> internal classpath dependency. This is for people in environments who want
> a sensible
> / delivered internal classpath default and the ability for run-time, non
> zipped up/messing
> with JAR file override. Think about people who are using OpenNLP in both
> Java/Python
> environments as an example.
>
> Cheers,
> Chris
>
>
>
>
> On 7/11/17, 3:25 AM, "Joern Kottmann" <kottm...@gmail.com> wrote:
>
>     I would not change the CLI to load models from jar files. I never used
>     or saw a command line tool that expects a file as an input and would
>     then also load it from inside a jar file. It will be hard to
>     communicate how that works precisely in the CLI usage texts and this
>     is not a feature anyone would expect to be there. The intention of the
>     CLI is to give users the ability to quickly test OpenNLP before they
>     integrate it into their software and to train and evaluate models
>
>     Users who for some reason have a jar file with a model inside can just
>     write "unzip model.jar".
>
>     After all I think this is quite  a bit of complexity we would need to
>     add for it and it will have very limited use.
>
>     The use case of publishing jar files is to make the models easily
>     available to people who have a build system with dependency
>     management, they won't have to download models manually, and when they
>     update OpenNLP then can also update the models with a version string
>     change.
>
>     For the command line "quick start" use case we should offer the models
>     on a download page as we do today. This page could list both, the
>     download link and the maven dependency.
>
>     Jörn
>
>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
> wrote:
>     > We need to address things such as sharing the evaluation results and
> how to
>     > reproduce the training.
>     >
>     > There are several possibilities for that, but there are points to
> consider:
>     >
>     > Will we store the model itself in a SCM repository or only the code
> that
>     > can build it?
>     > Will we deploy the models to a Maven Central repository? It is good
> for
>     > people using the Java API but not for command line interface, should
> we
>     > change the CLI to handle models in the classpath?
>     > Should we keep a copy of the training model or always download from
> the
>     > original provider? We can't guarantee that the corpus will be there
>     > forever, not only because it changed license, but simple because the
>     > provider is not keeping the server up anymore.
>     >
>     > William
>     >
>     >
>     >
>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottm...@gmail.com>:
>     >
>     >> Hello all,
>     >>
>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> component
>     >> which like all our components has to be trained. I think we should
>     >> release a pre-build model for it trained on the Leipzig corpus. This
>     >> will allow the majority of our users to get started very quickly
> with
>     >> language detection without the need to figure out on how to train
> it.
>     >>
>     >> How should this project release models?
>     >>
>     >> Jörn
>     >>
>
>
>
>

Reply via email to