Our commoncrawl slice+bugtrackers are currently 1 TB, govdocs1 is another
.5 TB.

2 TB would safely cover the source documents that we're currently using.



On Tue, Jun 2, 2020 at 6:08 AM Maruan Sahyoun <[email protected]>
wrote:

> How many TB would that be?
>
> > Still haven’t had time to put the server in a dmz. Ugh.
> >
> >  Yes, more than happy to share.
> >
> > If anyone has recommendations for file hosting for a couple of TB, let me
> > know.
> >
> > One option would be to work with CommonCrawl to bump the max file size
> one
> > crawl a year...
> >
> > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <[email protected]>
> > wrote:
> >
> > > Can we / I access these files? Most differences are improvements or not
> > > meaningful, but there are a few I'd like to have a look, e.g.
> > >
> > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > >
> > > the word "antrag" loses the first "a". Although maybe the "a" was a big
> > > one and gets assigned to another line.
> > >
> > > Tilman
> > >
> > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > Reports are available here:
> > >
> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > Looks like there are trivial differences in content with a slight
> > > > improvement over 2.0.19.  I don't see any differences in exceptions
> or
> > > > attachments.
> > > >
> > > > Cheers,
> > > >
> > > >          Tim
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> [email protected]
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>
>

Reply via email to