As the Tika project starts embracing more non Java tools (I’m thinking of Tesseract for example), dockerizing your Tika setup becomes more and more valuable.
For example, I run my tests for my application on my local Mac, as well as on CircleCI. I have a dockeriezed Tika service that does the OCR stuff, and I know it’s the same work on both. It’s less exciting if I’m in an “all Java” world. > On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <[email protected]> wrote: > > Thank you, Thejan! > > -----Original Message----- > From: Thejan Wijesinghe [mailto:[email protected]] > Sent: Wednesday, May 31, 2017 5:40 PM > To: [email protected] > Subject: Re: experiences with Tika in Docker > > Hi Tim, > > I've used Tika -server in docker but as a single instance only. Yes, its > ability to limit container's resources with related to memory & CPU in the > host machine is great, it gives us so much flexibility, we could enforce > hard/soft memory limits, we could even manipulate the host machine's CPU > cycles. Yes, it also limits risks of executing arbitrary code & XXE > vulnerabilities. I already asked Prof. Chris Mattmann about officially moving > to dockerhub. He said I need to make a mail to apache infra asking about > this. Unfortunately, I still couldn't find a time to make that mail. > > We already have multiple dockerfiles in Tika, , dockerfile in tika-server, > InceptionRestDockerfile, InceptionVideoRestDockerfile, > Im2txtRestDockerfile(PR #180-for image captioning). > > Part of my GSoC project is to unify the existing REST services such as object > recognition, image captioning. My idea is to unify all of those REST services > where the user can start/terminate, see statistics of any REST service > through a web based GUI. I'm expecting to use a fusion of nginx(as the > reverse proxy server) & docker to make it work. So obviously we will see > docker much often in Tika. > > +1 for your thought to looking into hardening the tika-server with the > +help > of docker. > > best, > ThejanW > > On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <[email protected]> > wrote: > >> Dave Meikle, Tom and All, >> >> How many of us are using Tika in Docker? If so, how exactly are >> you using it? Single instance, swarm, Kubernetes, something else? >> People fear I/O hit with tika-server...what are your experiences? >> I really like the ability to limit the number of CPUs in the Docker >> container. If a single doc causes multithreaded gc to go nuts, that >> won't kill an entire machine. This also cleanly limits the risk from >> XXE or arbitrary code execution, right? >> >> If this is one of the ways of the future for big data, we might want >> to look into hardening tika-server (OOMs, timeouts). What do you all think? >> >> Cheers, >> >> Tim >> >> Timothy B. Allison, Ph.D. >> Principal Artificial Intelligence Engineer Group Lead K83E/Human >> Language Technology The MITRE Corporation >> 7515 Colshire Drive, McLean, VA 22102 >> 703-983-2473 (phone); 703-983-1379 (fax) >> >> _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
