> Maruan,
>   To confirm, you're ok if we grant access to the server to our colleagues
> on Tika and POI?

to be clear - my company is only sponsoring the box. It's the projects decision 
who needs access not mine. So feel free.

BR
Maruan


>   Again, wow, THANK YOU!
> 
>                Best,
> 
>                           Tim
> 
> On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <[email protected]> wrote:
> 
> > > proper domain for https access
> > 
> > I just pinged infra on slack.
> > 
> > If they're able to do it, what would we want?
> > 
> > file-corpora.apache.org
> > corpora.apache.org
> > corpora-pdfbox.apache.org
> > corpora-tika.apache.org
> > 
> > Something else?  I'm also happy to buy a domain if that won't work.  There
> > are a couple available that are close enough.
> > 
> > On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <[email protected]>
> > wrote:
> > 
> > > > AMD ryzen looks fantastic.  Others would be great as well.
> > > > 
> > > > If ubuntu is possible at all, that's what I've been working with most
> > > > recently.
> > > 
> > > OK - will setup with that distro
> > > 
> > > > Other than that, ssh access and sudo privileges would be all I'd need.
> > > > 
> > > > Are you ok if we set up apache httpd to host files for the public or
> > > will
> > > > this be a community only resource?
> > > 
> > > it can be used for whatever we want it to - so if you consider public
> > > file sharing useful of course we can do that. Would be
> > > good if we get a proper domain for https access. Would that be something
> > > infra can do?
> > > 
> > > > If this is corporate sponsored, please let me know how/if we should
> > > mention
> > > > the sponsorship.
> > > 
> > > no need to mention it - happy to help.
> > > 
> > > > Again...wow.  Thank you!
> > > > 
> > > > Best,
> > > > 
> > > >       Tim
> > > > 
> > > > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <[email protected]>
> > > > wrote:
> > > > 
> > > > > Could fund either:
> > > > > 
> > > > > AMD Ryzen 5 3600
> > > > > 64 GB RAM
> > > > > 2x2TB
> > > > > 
> > > > > or
> > > > > 
> > > > > AMD Ryzen 7 3700X based Server
> > > > > 64 GB RAM
> > > > > 2x8TB
> > > > > 
> > > > > or
> > > > > Intel® Core™ i9-9900K
> > > > > 64 GB RAM
> > > > > 2x8TB
> > > > > 
> > > > > All are root servers so one has to vote for taking care of them (I
> > > can do
> > > > > the initial setup).
> > > > > 
> > > > > 
> > > > > 
> > > > > BR
> > > > > Maruan
> > > > > 
> > > > > > There are two use cases.
> > > > > > 
> > > > > > 1) host shared data so that we can all point to and work from the
> > > same
> > > > > > data, ideally both literal docs and also extracts (text/metadata
> > > .json
> > > > > > files representing extracted information).
> > > > > > 
> > > > > > 2) a modest vm to allow all of us to run the regression tests
> > > > > > 
> > > > > > We could use help with either or both.
> > > > > > 
> > > > > > What we had before:
> > > > > > 8 GB RAM
> > > > > > 8 cores
> > > > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
> > > > > > 
> > > > > > We can always use more RAM and more cores up to the point of I/O
> > > > > > bottlenecks.
> > > > > > 
> > > > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
> > > [email protected]>
> > > > > > wrote:
> > > > > > 
> > > > > > > is that a storage box only or does it need to do some computings
> > > too?
> > > > > > > Maybe you could write a small spec for the server requirement?
> > > > > > > 
> > > > > > > BR
> > > > > > > Maruan
> > > > > > > 
> > > > > > > 
> > > > > > > > Still haven’t had time to put the server in a dmz. Ugh.
> > > > > > > > 
> > > > > > > >  Yes, more than happy to share.
> > > > > > > > 
> > > > > > > > If anyone has recommendations for file hosting for a couple of
> > > TB,
> > > > > let me
> > > > > > > > know.
> > > > > > > > 
> > > > > > > > One option would be to work with CommonCrawl to bump the max
> > > file
> > > > > size
> > > > > > > one
> > > > > > > > crawl a year...
> > > > > > > > 
> > > > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > > Can we / I access these files? Most differences are
> > > improvements
> > > > > or not
> > > > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
> > > > > > > > > 
> > > > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
> > > > > > > > > 
> > > > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
> > > was
> > > > > a big
> > > > > > > > > one and gets assigned to another line.
> > > > > > > > > 
> > > > > > > > > Tilman
> > > > > > > > > 
> > > > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
> > > > > > > > > > > > Reports are available here:
> > > https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
> > > > > > > > > > Looks like there are trivial differences in content with a
> > > slight
> > > > > > > > > > improvement over 2.0.19.  I don't see any differences in
> > > > > exceptions
> > > > > > > or
> > > > > > > > > > attachments.
> > > > > > > > > > 
> > > > > > > > > > Cheers,
> > > > > > > > > > 
> > > > > > > > > >          Tim
> > > > > > > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > > For additional commands, e-mail: [email protected]
> > > > > > > > > 
> > > > > > > > > 
> > > > > --
> > > > > Maruan Sahyoun
> > > > > 
> > > > > FileAffairs GmbH
> > > > > Josef-Schappe-Straße 21
> > > > > 40882 Ratingen
> > > > > 
> > > > > Tel: +49 (2102) 89497 88
> > > > > Fax: +49 (2102) 89497 91
> > > > > [email protected]
> > > > > www.fileaffairs.de
> > > > > 
> > > > > Geschäftsführer: Maruan Sahyoun
> > > > > Handelsregister: AG Düsseldorf, HRB 53837
> > > > > UST.-ID: DE248275827
> > > > > 
> > > > > 
> > > --
> > > Maruan Sahyoun
> > > 
> > > FileAffairs GmbH
> > > Josef-Schappe-Straße 21
> > > 40882 Ratingen
> > > 
> > > Tel: +49 (2102) 89497 88
> > > Fax: +49 (2102) 89497 91
> > > [email protected]
> > > www.fileaffairs.de
> > > 
> > > Geschäftsführer: Maruan Sahyoun
> > > Handelsregister: AG Düsseldorf, HRB 53837
> > > UST.-ID: DE248275827
> > > 
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > > 
> > > 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
[email protected]
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to