On 01/03/2013 01:33 AM, Benson Margulies wrote:
On Wed, Jan 2, 2013 at 2:30 PM, Tim Miller
<[email protected]> wrote:
>The license is share alike 3.0, the reasons we need advice is because we are
>using modified/derived version (the clause in the legal FAQ starts
>"Unmodified media..."). Specifically, we built a lucene index with 5000
>wikipedia articles relating to medicine. Each article is modified by
>reducing it to list of words and their counts in that article. Is there some
>advice on whether this sort of modification is allowable or whether it
>disqualifies?
A language model derived from a corpus is not necessarily a derived
work of the corpus. Opinions vary. Some would tell you that it's a new
work entirely, and you own it. Others would tell you that you need a
specific license from the original content owners.
The answer probably also varies a lot on the legal system of the
country you are in. As far as I know things a stricter in some European
countries since they do not have a fair use clause like in the US.
Media Monitoring companies for example get away by using short
extracts (couple of words or sentences) from news articles and selling
them to their customers as their own work.
Statistical models usually contain much shorter pieces of text, often just
bi- or tri-grams and cannot be used to reconstruct longer pieces of text.
Jörn