On 2/1/11 4:57 PM, Grant Ingersoll wrote:
Your timing is great, as I was just about to suggest the same thing.
On Feb 1, 2011, at 6:51 AM, Jörn Kottmann wrote:
Hi all,
I would like to go ahead and get our first release out. The release is
backward compatible with the models we had over at SourceForge.
Which means we do not need to release new models right now.
The logic to train most of the models is already included in OpenNLP
and enables our users just to train the models them self or even
mix with their own data.
To release the models at Apache we have to go trough a series of legal
issues which I believe should not postpone our first release for
weeks or months.
Can you summarize here the issues? The last thread is mountainous. To some
extent, there is no time like the present to address the legal issues. The ASF
has legal counsel, if you can summarize what we do to make the models and what
the concerns are, we can take it over to legal-discuss@ and start working on
it. It may not be as big a deal as one might think.
The concerns are, that our models are trained on various closed or free
corpora which almost all have different licenses.
We would have to discuss if the trained model from each corpora is
allowed to be distributed under AL 2.0.
I believe in most cases we do not validate any copyright, because
statistics about text is not protected
by its copyright. We would for example generate bigram or trigram
features over the whole corpus.
In my opinion we at least need to provide a list of corpora and licenses
to start a discussion over at
the legal list, which alone will take some time to dig out, at least for
the english training data.
We also have training data where we are unsure about the license.
On the other side we have no advantage of doing it now as part of the
release.
In my opinion we should try to get the process started, maybe put
together a wiki page and as soon
as we have all the information we need for the legal people we start
talking to them.
Jörn