Or we just make a download script which bootstraps the users corpus folder.

Could be a couple of wget lines or so ...


Jörn

On Wed, Apr 29, 2015 at 6:17 AM, William Colen <william.co...@gmail.com>
wrote:

> Automating the download would be fine as long as we cache it, as Richard
> suggested. Maybe it could be done by a script to prepare the environment,
> and not be part of the unit test itself.
> Anyway, it would be a good idea to save the data somewhere because we never
> know if some of the websites will become unavailable in the future.
>
>
> 2015-04-15 5:31 GMT-03:00 Richard Eckart de Castilho <
> richard.eck...@gmail.com>:
>
> > On 15.04.2015, at 10:23, Joern Kottmann <kottm...@gmail.com> wrote:
> >
> > > With publicly accessible data I mean a corpus you can somehow acquire,
> > > opposed to the data you create on your own for a project.
> > >
> > > All the corpora we support in the formats package are publicly
> > accessible.
> > > Maybe
> > > some you have to buy and for others you just have to sign some
> agreement.
> > >
> > > A very interesting corpus for testing (and training models on) is
> > OntoNotes.
> > >
> > > Here is a link to the LDC entry:
> > > https://catalog.ldc.upenn.edu/LDC2011T03
> > >
> > > You can get it for free (or for a small distribution fee) but you can't
> > > just download it.
> > > It would be great if the ASF could acquire this data set so we can
> share
> > it
> > > among the committers.
> > >
> > > Is that what you mean with proprietary data?
> >
> > Yes, that is what I mean.
> >
> > E.g. the TIGER corpus requires clicking through some pages and forms to
> > reach a download page, but in principle, it appears as if the corpus was
> > simply downloadable by a deep-link URL. The license terms state, that the
> > corpus must not be redistributed.
> >
> > Some tools are also publicly accessible and downloadable but not
> > redistributable. For example anybody can download TreeTagger and its
> > models, but only from the original homepage. It is not permitted to
> > redistribute it, i.e. to publish it to a repository or offer it on an
> > alternative homepage.
> >
> > So there is a (small) class of resources between being redistributable
> and
> > proprietary (for fee), namely being in principle publicly accessible (for
> > free) but not redistributable.
> >
> > Cheers,
> >
> > -- Richard
>

Reply via email to