On 15.04.2015, at 10:23, Joern Kottmann <kottm...@gmail.com> wrote: > With publicly accessible data I mean a corpus you can somehow acquire, > opposed to the data you create on your own for a project. > > All the corpora we support in the formats package are publicly accessible. > Maybe > some you have to buy and for others you just have to sign some agreement. > > A very interesting corpus for testing (and training models on) is OntoNotes. > > Here is a link to the LDC entry: > https://catalog.ldc.upenn.edu/LDC2011T03 > > You can get it for free (or for a small distribution fee) but you can't > just download it. > It would be great if the ASF could acquire this data set so we can share it > among the committers. > > Is that what you mean with proprietary data?
Yes, that is what I mean. E.g. the TIGER corpus requires clicking through some pages and forms to reach a download page, but in principle, it appears as if the corpus was simply downloadable by a deep-link URL. The license terms state, that the corpus must not be redistributed. Some tools are also publicly accessible and downloadable but not redistributable. For example anybody can download TreeTagger and its models, but only from the original homepage. It is not permitted to redistribute it, i.e. to publish it to a repository or offer it on an alternative homepage. So there is a (small) class of resources between being redistributable and proprietary (for fee), namely being in principle publicly accessible (for free) but not redistributable. Cheers, -- Richard