Re: Releasing a Language Detection Model

Joern Kottmann Tue, 11 Jul 2017 00:39:03 -0700

I am also not for default models. We are a library and people use it
inside other software products, that is the place where meaningful
defaults can be defined. Maybe our lang model works very well, you
take that, hard code it and forget for the next couple of years about
it, or it doesn't work and you train your own set of models and swap
them depending on your input data source.


And then there are solutions out there people can use to define
configuration for their software projects, such as spring or typesafe.
And probably something new one day. I am +1 to ensure that OpenNLP is
easy to use with the most common ones and to accept PRs to increase
ease of use.

Jörn

On Tue, Jul 11, 2017 at 3:45 AM,  <dr...@apache.org> wrote:
> +1 for releasing models
>
> as for the rest not sure how I feel.  Is there just one model for the 
> Language Detector? I don’t want this to become a versioning issue 
> langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2.  Can anyone 
> download the Leipzig corpus? Being able to reproduce the model is very 
> powerful, because if you have additional data you can add it to the Leipzig 
> corpus to improve your model.
>
> I am not a big fan of default models, because it is frustrating as a using 
> when unexpected things happen (like if you thing you are telling it to use 
> your model, but it uses the default).  However, if the code is verbose 
> enough, this is really not a valid concern.  I would want to see the use case 
> develop.
> Daniel
>
>
>> On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu <aliaksa...@autayeu.com> 
>> wrote:
>>
>> Great idea!
>>
>> +1 for releasing models.
>>
>> +1 to publish models in jars on Maven Central. This is the fastest way to
>> have somebody started. Moreover, having an extensible mechanism for others
>> to do it on their own is really helpful. I did this with extJWNL for
>> packaging WordNet data files. It is also convenient for packaging own
>> custom dictionaries and providing them via repositories. It reuses existing
>> infrastructure for things like versioning and distribution. Model metadata
>> has to be thought through though. Oh, what a mouthful...
>>
>> +1 for separate download ("no dependency manager" cases)
>>
>> +1 to publish data\scripts\provenance. The more reproducible it is, the
>> better.
>>
>> +1 for some mechanism of loading models from classpath.
>>
>> ~ +1 to maybe explore classpath for a "default" model for API (code) use
>> cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
>> extJWNL. But this has to be well thought through as design mistakes here
>> might release some demons from jar hell. I didn't face it, but I'm not sure
>> the extJWNL design is best as I didn't do much research on alternatives.
>> And I'd think twice before adding model jars to main binary distribution.
>>
>> +1 to store only the model-building-code in SCM repo. I would not bloat the
>> SCM with binaries. Maven repositories though not ideal, are better for this
>> than SCM (and there specialized tools like jFrog).
>>
>> ~ -1 about changing CLI to use models from classpath. There was no
>> proposal, but my understanding that it would be some sort of classpath://
>> URL - please correct or clarify. I'd like to see the proposal and use cases
>> where it is more convenient than current way of just pointing to the file.
>> Perhaps it depends. Our models are already zips with manifests. Jars are
>> zips too. Perhaps changing the model packaging layout to make it more
>> "jar-like" or augmenting it with metadata for searching default models from
>> classpath for the above cases of distributing through Maven repositories
>> and loading from code, but perhaps leaving CLI as is - even if your model
>> is technically on the classpath, in most cases you can point to a jar in
>> the file system and thus leave CLI like it is now. It seems that dealing
>> with classpath is more suitable (convenient, safer, customary, ...) for
>> developers fiddling with code than for users fiddling with command-line.
>>
>> +1 for mirroring source corpora. The more reproducible things are the
>> better. But costs (infrastructure) and licenses (this looks like
>> redistribution which is not always allowed) might be the case.
>>
>> I'd also propose to augment model metadata with (optional) information
>> about source corpora, provenance, as much reproduction information as
>> possible, etc. Mostly for easier reproduction and provenance tracking. In
>> my experience I had challenges recalling what y-d-u-en.bin was trained on,
>> on which revision of that corpus, which part or subset, language, and
>> whether it had also other annotations (and respective models) for
>> connecting all the possible models from that corpora (e.g.
>> sent-tok-pos-chunk-...).
>>
>> Aliaksandr
>>
>> On 10 July 2017 at 17:41, Jeff Zemerick <jzemer...@apache.org> wrote:
>>
>>> +1 to an opennlp-models jar on Maven Central that contains the models.
>>> +1 to having the models available for download separately (if easily
>>> possible) for users who know what they want.
>>> +1 to having the training data shared somewhere with scripts to generate
>>> the models. It will help protect against losing data as William mentioned.
>>> I don't think we should depend on others to reliably host the data. I'll
>>> volunteer to help script the model generation to run on a fleet of EC2
>>> instances if it helps.
>>>
>>> If the user does not provide a model to use on the CLI, can the CLI tools
>>> look on the classpath for a model whose name fits the needed model (like
>>> en-ner-person.bin) and if found use it automatically?
>>>
>>> Jeff
>>>
>>>
>>>
>>> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <mattm...@apache.org>
>>> wrote:
>>>
>>>> +1. In terms of releasing models, maybe an opennlp-models package, and
>>> then
>>>> using Maven structure of src/main/resources/<package prefix dirs>/*.bin
>>> for
>>>> putting the models.
>>>>
>>>> Then using an assembly descriptor to compile the above into a *-bin.jar?
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>> On 7/10/17, 4:09 PM, "Joern Kottmann" <kottm...@gmail.com> wrote:
>>>>
>>>>    My opinion about this is that we should offer the model as maven
>>>>    dependency for users who just want to use it in their projects, and
>>>>    also offer models for download for people to quickly try out OpenNLP.
>>>>    If the models can be downloaded, a new users could very quickly test
>>>>    it via the command line.
>>>>
>>>>    I don't really have any thoughts yet on how we should organize it, it
>>>>    would probably be nice to have some place where we can share all the
>>>>    training data, and then have the scripts to produce the models
>>> checked
>>>>    in. It should be easy to retrain all the models in case we do a major
>>>>    release.
>>>>
>>>>    In case a corpus is vanishing we should drop support for it, must be
>>>>    obsolete then.
>>>>
>>>>    Jörn
>>>>
>>>>    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <co...@apache.org>
>>>> wrote:
>>>>> We need to address things such as sharing the evaluation results
>>> and
>>>> how to
>>>>> reproduce the training.
>>>>>
>>>>> There are several possibilities for that, but there are points to
>>>> consider:
>>>>>
>>>>> Will we store the model itself in a SCM repository or only the code
>>>> that
>>>>> can build it?
>>>>> Will we deploy the models to a Maven Central repository? It is good
>>>> for
>>>>> people using the Java API but not for command line interface,
>>> should
>>>> we
>>>>> change the CLI to handle models in the classpath?
>>>>> Should we keep a copy of the training model or always download from
>>>> the
>>>>> original provider? We can't guarantee that the corpus will be there
>>>>> forever, not only because it changed license, but simple because
>>> the
>>>>> provider is not keeping the server up anymore.
>>>>>
>>>>> William
>>>>>
>>>>>
>>>>>
>>>>> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottm...@gmail.com>:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> since Apache OpenNLP 1.8.1 we have a new language detection
>>>> component
>>>>>> which like all our components has to be trained. I think we should
>>>>>> release a pre-build model for it trained on the Leipzig corpus.
>>> This
>>>>>> will allow the majority of our users to get started very quickly
>>>> with
>>>>>> language detection without the need to figure out on how to train
>>>> it.
>>>>>>
>>>>>> How should this project release models?
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: Releasing a Language Detection Model

Reply via email to