There are legal grounds, in the US, at least, for viewing models
derived from some corpora as *not* being constrained by the copyright
on the corpus. Since IANAL I will not try to reproduce the reasoning
here.

This reasoning is already applied to the Baysian models build by the
SpamAssassin project, which maintains a private corpus that includes
copyrighted materials.

A factor here is whether the data is collected from a publication
(e.g. the open web) or obtained under a specific license. Obviously,
in the second case, the license controls usage.

As discussed here before, this situation is still not ideal, as an
open source project would like the sources of its models (i.e. the
corpora) to be open. However, if you want a model that performs well
on CNN, or Jane Austin, you're not terribly likely to obtain one by
training it on WikiNews.

On Thu, Feb 2, 2012 at 4:38 AM, Riccardo Tasso <riccardo.ta...@gmail.com> wrote:
> For the user interface, it would be interesting using create.js:
> http://createjs.org/
>
> What about the backend? I've contributed to a similar tool for my company,
> I'm sorry but it can't be released, but I can say we used a mongoDB database
> in which each document is a sentence.
>
> Then I think it should be useful tag sentences with as many information as
> possible, such as corpus/provenance, language, tagger (automatic or manual),
> reliability...
>
> Let's talk about it, I think it would be a very useful resource for
> everyone.
>
> Riccardo
>
>
> On 02/02/2012 08:37, Katrin Tomanek wrote:
>>
>> Hi James, hi Jason,
>>
>> thanks for clarifying the license situation!
>>
>> About your plans on iterative training data generation: that sound
>> interesting and we probably been doing similar stuff to build corpora. So it
>> would really be interesting to share ideas or join forces.
>>
>> Best
>> Katrin
>>
>>
>>
>> On 02/02/2012 02:52 AM, James Kosin wrote:
>>>
>>> Katrin,
>>>
>>> Hmm... maybe I'll be writing a fact page to go on our web-site until we
>>> get this straightened out.
>>>
>>> 1)  The models at sourceforge are primarily used only for research.  No
>>> commercial usage.
>>> 2)  Most of the corpus are heavily copyrighted and exclude all
>>> commercial usage.  Mostly because they are fully copyrighted texts and
>>> are treated as most books are...
>>>
>>> 3)  Both these out of the way, our team is also attempting to put
>>> together a way for us to generate and get a free corpus based on other
>>> free sources.  Where the copyright is more of a free information
>>> exchange.  I think WikiNews has been looked at as well as other
>>> sources.  We have a sample server applet that will eventually run on a
>>> server to allow us to mark/tag/take apart the information and generate
>>> the correct format for the training data required for the namefinder,
>>> tokenizer and POS tagger.
>>>       Help on this is extremely welcome and I think you and anyone else
>>> interested can contact Jorn to get started or how to help.
>>>
>>> James
>>>
>>> On 2/1/2012 6:42 AM, Katrin Tomanek wrote:
>>>>
>>>> Hi everybody,
>>>>
>>>> I am wondering what licence the models provided for the apache-opennlp
>>>> tools (those that can be found at:
>>>> http://opennlp.sourceforge.net/models-1.5/) are of.
>>>>
>>>> As an example: the models based on the tiger corpus -- are they also
>>>> subject to the apache licence? if not, what licence? Same question for
>>>> models based on conll data.
>>>>
>>>> So, as a company, can we use these models in a commercial context or
>>>> do we have to licence the original corpus additionally ?
>>>>
>>>> Best
>>>> Katrin
>>>>
>>>>
>>>>
>>>
>>
>>
>

Reply via email to