Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

sahy...@fileaffairs.de Thu, 16 Jan 2025 05:46:34 -0800

Am Donnerstag, dem 16.01.2025 um 08:26 -0500 schrieb Tim Allison:
> This is a really helpful delineation of the issues. Thank you,
> Maruan, for
> this and for all of your support with the server.
> 
> I'll open a ticket on LEGAL's jira?



 Yes please.

> 
> On Wed, Jan 15, 2025 at 3:55 AM sahy...@fileaffairs.de <
> sahy...@fileaffairs.de> wrote:
> 
> > Hi Tim,
> > 
> > IMHO there are several parts to it.
> > 
> > a) serving content which might look like other corps sites can be
> > interpreted as phishing
> > b) scraping and storing coyprighted content
> > c) scraping and storing content containing personal data
> > 
> > a) is being dealt with in the current form. As long as we don't
> > publicly serve the files we are fine. We could also allow password
> > protected https access if that has a benefit over ssh.
> > b) scraping copyrighted information is typically OK (there are
> > legal
> > cases where this has been decided) although there might be cases
> > where
> > we need to remove individual files
> > c) scraping and storing personal data is mostly not OK with GDPR
> > and
> > other acts without permission. This becomes very difficult to
> > handle.
> > E.g. if one uploaded a file to a bug tracker one could argue that
> > if
> > that file contained personal data by uploading one gave permission
> > to
> > use it within the context of the bug tracking and the dev process
> > behind it. That doesn't include permission to load the file from
> > that
> > system and use it in a different context.
> > 
> > I think until c is sorted we can not allow access in a wider
> > context
> > and even need to reconsider if we can use it at all although being
> > very
> > beneficial.
> > 
> > Maybe we can have a chat with legal about that.
> > 
> > BR
> > Maruan
> > 
> > 
> > 
> > 
> > Am Dienstag, dem 14.01.2025 um 08:17 -0500 schrieb Tim Allison:
> > > Hi Stefan,
> > > 
> > >   I'm sorry for this sudden change. I'm hoping that we can find a
> > > way
> > > to
> > > make this all work again, but there are complexities. Part of the
> > > challenge
> > > is that the liability is spread across several organizations and
> > > individuals; part of the challenge is everything to do with the
> > > varying
> > > global legal/privacy requirements around crawled data. And there
> > > are
> > > other
> > > challenges.
> > > 
> > >   These corpora have been critical to numerous parsing projects
> > > at
> > > the ASF
> > > and to devs and projects outside of ASF.   I've heard from a few
> > > others
> > > offline who are also affected by this.
> > > 
> > > 
> > > All,
> > >   What are our priorities? How can we move forward? Some options
> > > that
> > > I see:
> > > 
> > > 0) nuclear option: shutdown the server entirely
> > > 1) continue as we have it now -- no http/s access
> > > 2) host reports/metadata only via https
> > > 3) host "packaged" corpora in zips (password protected?) via
> > > https
> > > 4) password protect https access to the corpora
> > > 5) not a viable option: turn everything back on
> > > 6) not a viable option: turn everything back on with a strict
> > > robots.txt
> > > policy
> > > 
> > >   Any other options? What are our preferences?
> > > 
> > >           Best,
> > > 
> > >                 Tim
> > > 
> > > On Sat, Jan 11, 2025 at 9:01 AM stefan6419846
> > > <stefan6419...@gmail.com>
> > > wrote:
> > > 
> > > > We at pypdf (https://github.com/py-pdf/pypdf) have been hit by
> > > > the
> > > > unexpected shutdown of the service and were glad to at least
> > > > find
> > > > this
> > > > indirect announcement. Nevertheless, it seems like we have to
> > > > find
> > > > a
> > > > suitable alternative for the previously used govdocs1 PDF files
> > > > from
> > > > your server, as the official govdocs1 sources do not expose the
> > > > single
> > > > PDF files directly.
> > > > 
> > > > Thanks for hosting these files in the past.
> > > > 
> > > > Best regards,
> > > > Stefan
> > > > 
> > > > On 2025/01/09 01:36:59 Tim Allison wrote:
> > > > > \All,
> > > > >  We've gotten a handful of takedown requests recently. I had
> > > > > initially
> > > > > envisioned public sharing of files as a key component of our
> > > > > server. We
> > > > can
> > > > > still use the files and offer read access to fellow file
> > > > > researchers. I'm
> > > > > not sure I want to deal with further takedown requests.
> > > > >  As an intermediate step, we could ask robots not to crawl
> > > > > the
> > > > > data, but
> > > > > that's not reliable.
> > > > >  So, in lieu of that, with heavy heart, I ask if it is time
> > > > > to
> > > > > close off
> > > > > public access?
> > > > >   WDYT?
> > > > > 
> > > > >           Best,
> > > > > 
> > > > >                     Tim
> > > > > 
> > > > 
> > 
> >

Re: Where do we go from here? WAS: Turning off public access to the regression corpora?

Reply via email to