Maruan,
  To confirm, you're ok if we grant access to the server to our colleagues
on Tika and POI?
  Again, wow, THANK YOU!

               Best,

                          Tim

On Tue, Jun 2, 2020 at 3:57 PM Tim Allison <[email protected]> wrote:

> >proper domain for https access
>
> I just pinged infra on slack.
>
> If they're able to do it, what would we want?
>
> file-corpora.apache.org
> corpora.apache.org
> corpora-pdfbox.apache.org
> corpora-tika.apache.org
>
> Something else?  I'm also happy to buy a domain if that won't work.  There
> are a couple available that are close enough.
>
> On Tue, Jun 2, 2020 at 1:08 PM Maruan Sahyoun <[email protected]>
> wrote:
>
>>
>> > AMD ryzen looks fantastic.  Others would be great as well.
>> >
>> > If ubuntu is possible at all, that's what I've been working with most
>> > recently.
>>
>> OK - will setup with that distro
>>
>> >
>> > Other than that, ssh access and sudo privileges would be all I'd need.
>> >
>> > Are you ok if we set up apache httpd to host files for the public or
>> will
>> > this be a community only resource?
>>
>> it can be used for whatever we want it to - so if you consider public
>> file sharing useful of course we can do that. Would be
>> good if we get a proper domain for https access. Would that be something
>> infra can do?
>>
>> >
>> > If this is corporate sponsored, please let me know how/if we should
>> mention
>> > the sponsorship.
>>
>> no need to mention it - happy to help.
>>
>> >
>> > Again...wow.  Thank you!
>> >
>> > Best,
>> >
>> >       Tim
>> >
>> > On Tue, Jun 2, 2020 at 9:22 AM Maruan Sahyoun <[email protected]>
>> > wrote:
>> >
>> > > Could fund either:
>> > >
>> > > AMD Ryzen 5 3600
>> > > 64 GB RAM
>> > > 2x2TB
>> > >
>> > > or
>> > >
>> > > AMD Ryzen 7 3700X based Server
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > or
>> > > Intel® Core™ i9-9900K
>> > > 64 GB RAM
>> > > 2x8TB
>> > >
>> > > All are root servers so one has to vote for taking care of them (I
>> can do
>> > > the initial setup).
>> > >
>> > >
>> > >
>> > > BR
>> > > Maruan
>> > >
>> > > > There are two use cases.
>> > > >
>> > > > 1) host shared data so that we can all point to and work from the
>> same
>> > > > data, ideally both literal docs and also extracts (text/metadata
>> .json
>> > > > files representing extracted information).
>> > > >
>> > > > 2) a modest vm to allow all of us to run the regression tests
>> > > >
>> > > > We could use help with either or both.
>> > > >
>> > > > What we had before:
>> > > > 8 GB RAM
>> > > > 8 cores
>> > > > 4 TB -- 2TB for docs, 1TB for extracts, 1TB for staging
>> > > >
>> > > > We can always use more RAM and more cores up to the point of I/O
>> > > > bottlenecks.
>> > > >
>> > > > On Tue, Jun 2, 2020 at 6:37 AM Maruan Sahyoun <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > is that a storage box only or does it need to do some computings
>> too?
>> > > > >
>> > > > > Maybe you could write a small spec for the server requirement?
>> > > > >
>> > > > > BR
>> > > > > Maruan
>> > > > >
>> > > > >
>> > > > > > Still haven’t had time to put the server in a dmz. Ugh.
>> > > > > >
>> > > > > >  Yes, more than happy to share.
>> > > > > >
>> > > > > > If anyone has recommendations for file hosting for a couple of
>> TB,
>> > > let me
>> > > > > > know.
>> > > > > >
>> > > > > > One option would be to work with CommonCrawl to bump the max
>> file
>> > > size
>> > > > > one
>> > > > > > crawl a year...
>> > > > > >
>> > > > > > On Tue, Jun 2, 2020 at 1:48 AM Tilman Hausherr <
>> > > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Can we / I access these files? Most differences are
>> improvements
>> > > or not
>> > > > > > > meaningful, but there are a few I'd like to have a look, e.g.
>> > > > > > >
>> > > > > > > commoncrawl3/commoncrawl3/XO/XOAAGISRMRPZQRZF4LSMJERGEYK5QI2T
>> > > > > > >
>> > > > > > > the word "antrag" loses the first "a". Although maybe the "a"
>> was
>> > > a big
>> > > > > > > one and gets assigned to another line.
>> > > > > > >
>> > > > > > > Tilman
>> > > > > > >
>> > > > > > > Am 02.06.2020 um 02:58 schrieb Tim Allison:
>> > > > > > > > > > Reports are available here:
>> > >
>> https://github.com/tballison/share/blob/master/tika_comparisons/reports-pdfbox-2.0.20.tgz
>> > > > > > > > Looks like there are trivial differences in content with a
>> slight
>> > > > > > > > improvement over 2.0.19.  I don't see any differences in
>> > > exceptions
>> > > > > or
>> > > > > > > > attachments.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > >
>> > > > > > > >          Tim
>> > > > > > > >
>> > > ---------------------------------------------------------------------
>> > > > > > > To unsubscribe, e-mail: [email protected]
>> > > > > > > For additional commands, e-mail: [email protected]
>> > > > > > >
>> > > > > > >
>> > > --
>> > > Maruan Sahyoun
>> > >
>> > > FileAffairs GmbH
>> > > Josef-Schappe-Straße 21
>> > > 40882 Ratingen
>> > >
>> > > Tel: +49 (2102) 89497 88
>> > > Fax: +49 (2102) 89497 91
>> > > [email protected]
>> > > www.fileaffairs.de
>> > >
>> > > Geschäftsführer: Maruan Sahyoun
>> > > Handelsregister: AG Düsseldorf, HRB 53837
>> > > UST.-ID: DE248275827
>> > >
>> > >
>> --
>> Maruan Sahyoun
>>
>> FileAffairs GmbH
>> Josef-Schappe-Straße 21
>> 40882 Ratingen
>>
>> Tel: +49 (2102) 89497 88
>> Fax: +49 (2102) 89497 91
>> [email protected]
>> www.fileaffairs.de
>>
>> Geschäftsführer: Maruan Sahyoun
>> Handelsregister: AG Düsseldorf, HRB 53837
>> UST.-ID: DE248275827
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Reply via email to