Re: Releasing a Language Detection Model

Joern Kottmann Tue, 11 Jul 2017 06:44:26 -0700

1) This already included today by default in the model, it is possible
to also place more data in it e.g. a file which contains eval results,
a LICENSE and NOTICE file, etc


2) I would take a "best effort" approach and only publish one model
per task and data set, if there are not really good reasons to publish
multiple. In case of langdetect the perceptron and maxent models
perform almost identical, so no need to publish both. Probably we
should pick the perceptron model because it is slightly faster. And if
a user disagrees with us - that is totally fine - he can always train
himself with his personal preferences.

All the knowledge on how to train a model should be accessible via
git, and then it is just a matter of running the right command to
start it.

Jörn

On Tue, Jul 11, 2017 at 3:35 PM, Suneel Marthi <[email protected]> wrote:
> ...one last point before wrapping up this discussion.  Is it possible to
> that u could have more than one lang detect model but trained with
> different algorithms - like say 'MaxEnt', 'Naive Bayes', ' Perceptron'
>
> Questions:
>
> 1.   Do we just publish one model trained on a specific algorithm, if so
> the metadata would have the algorithm information ?
>
> 2.  Do we publish multiple models for the same task, each trained on
> different algorithms ?
>
>
>
> On Tue, Jul 11, 2017 at 9:30 AM, Joern Kottmann <[email protected]> wrote:
>
>> Hello,
>>
>> right, very good point, I also think that it is very important to load
>> a model in one from the classpath.
>>
>> I propose we have the following setup:
>> - One jar contains one or multiple model packages (thats the zip container)
>> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
>> - A user loads the model via: new
>> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
>> gets then closed properly
>>
>>
>> Lets take away three things from this discussion:
>> 1) Store the data in a place where the community can access it
>> 2) Offer models on our download page similar as it is done today on
>> the SourceForge page
>> 3) Release models packed inside a jar file via maven central
>>
>> Jörn
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
>> <[email protected]> wrote:
>> > To clarify on models and jars.
>> >
>> > Putting model inside jar might not be a good idea. I mean here things
>> like
>> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
>> "jars"
>> > already in a sense. We're good. However, current packaging and metadata
>> > might not be very classpath friendly.
>> >
>> > The use case I have in mind is being able to add needed models as
>> > dependencies and load them by writing a line of code. For this case
>> having
>> > all models in a root with the same name might not be very convenient.
>> Same
>> > goes for manifest. The name "manifest.properties" is quite generic and
>> it's
>> > not too far-fetched to see some clashes because some other lib also
>> > manifests something. It might be better to allow for some flexibility and
>> > to adhere to classpath conventions. For example, having manifests in
>> > something like org/apache/opennlp/models/manifest.properties. Or
>> > opennlp/tools/manifest.properties. And perhaps even allowing to
>> reference a
>> > model in the manifest, so the model can be put elsewhere. Just in case
>> > there are several custom models of the same kind for different pipelines
>> in
>> > the same app. For example, processing queries with one pipeline - one set
>> > of models - and processing documents with another pipeline - another set
>> of
>> > models. In this case allowing for different classpaths is needed.
>> >
>> > Perhaps to illustrate my thinking, something like this (which still
>> keeps a
>> > lot of possibilities open):
>> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
>> contains
>> > a line with something like model =
>> > /opennlp/tools/sentdetect/model/sent.model)
>> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
>> >
>> > This allows including en-sent.bin as dependency. And then doing something
>> > like
>> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
>> want
>> > default models in this way. Seems verbose enough to allow for some safety
>> > through explicitness. That's if we want any defaults at all.
>> > Or something like:
>> > SentenceModel sdm =
>> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
>> properties");
>> > Or
>> > SentenceModel sdm =
>> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
>> model");
>> > Or more in-line with a current style:
>> > SentenceModel sdm = new
>> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
>> here
>> > we commit to interpreting String as classpath reference. That's why I'd
>> > prefer more explicit method names.
>> > Or leave dealing with resources to the users, leave current code intact
>> and
>> > provide only packaging and distribution:
>> > SentenceModel sdm = new
>> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
>> > model"));
>> >
>> >
>> > And to add to model metadata also F1\accuracy (at least CV-based, for
>> > example 10-fold) for quick reference or quick understanding of what that
>> > model is capable of. Could be helpful for those with a bunch of models
>> > around. And for others as well to have a better insight about the model
>> in
>> > question.
>> >
>> >
>> >
>> > On 11 July 2017 at 06:37, Chris Mattmann <[email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
>> to
>> >> override an
>> >> internal classpath dependency. This is for people in environments who
>> want
>> >> a sensible
>> >> / delivered internal classpath default and the ability for run-time, non
>> >> zipped up/messing
>> >> with JAR file override. Think about people who are using OpenNLP in both
>> >> Java/Python
>> >> environments as an example.
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >>
>> >>
>> >>
>> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <[email protected]> wrote:
>> >>
>> >>     I would not change the CLI to load models from jar files. I never
>> used
>> >>     or saw a command line tool that expects a file as an input and would
>> >>     then also load it from inside a jar file. It will be hard to
>> >>     communicate how that works precisely in the CLI usage texts and this
>> >>     is not a feature anyone would expect to be there. The intention of
>> the
>> >>     CLI is to give users the ability to quickly test OpenNLP before they
>> >>     integrate it into their software and to train and evaluate models
>> >>
>> >>     Users who for some reason have a jar file with a model inside can
>> just
>> >>     write "unzip model.jar".
>> >>
>> >>     After all I think this is quite  a bit of complexity we would need
>> to
>> >>     add for it and it will have very limited use.
>> >>
>> >>     The use case of publishing jar files is to make the models easily
>> >>     available to people who have a build system with dependency
>> >>     management, they won't have to download models manually, and when
>> they
>> >>     update OpenNLP then can also update the models with a version string
>> >>     change.
>> >>
>> >>     For the command line "quick start" use case we should offer the
>> models
>> >>     on a download page as we do today. This page could list both, the
>> >>     download link and the maven dependency.
>> >>
>> >>     Jörn
>> >>
>> >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <[email protected]>
>> >> wrote:
>> >>     > We need to address things such as sharing the evaluation results
>> and
>> >> how to
>> >>     > reproduce the training.
>> >>     >
>> >>     > There are several possibilities for that, but there are points to
>> >> consider:
>> >>     >
>> >>     > Will we store the model itself in a SCM repository or only the
>> code
>> >> that
>> >>     > can build it?
>> >>     > Will we deploy the models to a Maven Central repository? It is
>> good
>> >> for
>> >>     > people using the Java API but not for command line interface,
>> should
>> >> we
>> >>     > change the CLI to handle models in the classpath?
>> >>     > Should we keep a copy of the training model or always download
>> from
>> >> the
>> >>     > original provider? We can't guarantee that the corpus will be
>> there
>> >>     > forever, not only because it changed license, but simple because
>> the
>> >>     > provider is not keeping the server up anymore.
>> >>     >
>> >>     > William
>> >>     >
>> >>     >
>> >>     >
>> >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <[email protected]>:
>> >>     >
>> >>     >> Hello all,
>> >>     >>
>> >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
>> >> component
>> >>     >> which like all our components has to be trained. I think we
>> should
>> >>     >> release a pre-build model for it trained on the Leipzig corpus.
>> This
>> >>     >> will allow the majority of our users to get started very quickly
>> >> with
>> >>     >> language detection without the need to figure out on how to train
>> >> it.
>> >>     >>
>> >>     >> How should this project release models?
>> >>     >>
>> >>     >> Jörn
>> >>     >>
>> >>
>> >>
>> >>
>> >>
>>

Re: Releasing a Language Detection Model

Reply via email to