Guys, i can help with Tika dockerization. just let design/plan what we gonna do.
On Thu, Jun 1, 2017 at 4:02 PM, Eric Pugh <[email protected]> wrote: > As the Tika project starts embracing more non Java tools (I’m thinking of > Tesseract for example), dockerizing your Tika setup becomes more and more > valuable. > > For example, I run my tests for my application on my local Mac, as well as > on CircleCI. I have a dockeriezed Tika service that does the OCR stuff, > and I know it’s the same work on both. It’s less exciting if I’m in an > “all Java” world. > > > > On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <[email protected]> > wrote: > > > > Thank you, Thejan! > > > > -----Original Message----- > > From: Thejan Wijesinghe [mailto:[email protected]] > > Sent: Wednesday, May 31, 2017 5:40 PM > > To: [email protected] > > Subject: Re: experiences with Tika in Docker > > > > Hi Tim, > > > > I've used Tika -server in docker but as a single instance only. Yes, its > ability to limit container's resources with related to memory & CPU in the > host machine is great, it gives us so much flexibility, we could enforce > hard/soft memory limits, we could even manipulate the host machine's CPU > cycles. Yes, it also limits risks of executing arbitrary code & XXE > vulnerabilities. I already asked Prof. Chris Mattmann about officially > moving to dockerhub. He said I need to make a mail to apache infra asking > about this. Unfortunately, I still couldn't find a time to make that mail. > > > > We already have multiple dockerfiles in Tika, , dockerfile in > tika-server, InceptionRestDockerfile, InceptionVideoRestDockerfile, > Im2txtRestDockerfile(PR #180-for image captioning). > > > > Part of my GSoC project is to unify the existing REST services such as > object recognition, image captioning. My idea is to unify all of those REST > services where the user can start/terminate, see statistics of any REST > service through a web based GUI. I'm expecting to use a fusion of nginx(as > the reverse proxy server) & docker to make it work. So obviously we will > see docker much often in Tika. > > > > +1 for your thought to looking into hardening the tika-server with the > > +help > > of docker. > > > > best, > > ThejanW > > > > On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <[email protected]> > > wrote: > > > >> Dave Meikle, Tom and All, > >> > >> How many of us are using Tika in Docker? If so, how exactly are > >> you using it? Single instance, swarm, Kubernetes, something else? > >> People fear I/O hit with tika-server...what are your experiences? > >> I really like the ability to limit the number of CPUs in the Docker > >> container. If a single doc causes multithreaded gc to go nuts, that > >> won't kill an entire machine. This also cleanly limits the risk from > >> XXE or arbitrary code execution, right? > >> > >> If this is one of the ways of the future for big data, we might want > >> to look into hardening tika-server (OOMs, timeouts). What do you all > think? > >> > >> Cheers, > >> > >> Tim > >> > >> Timothy B. Allison, Ph.D. > >> Principal Artificial Intelligence Engineer Group Lead K83E/Human > >> Language Technology The MITRE Corporation > >> 7515 Colshire Drive, McLean, VA 22102 > >> 703-983-2473 (phone); 703-983-1379 (fax) > >> > >> > > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com <http://www. > opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr- > enterprise-search-server-third-edition-raw> > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > >
