As the Tika project starts embracing more non Java tools (I’m thinking of 
Tesseract for example), dockerizing your Tika setup becomes more and more 
valuable.   

For example, I run my tests for my application on my local Mac, as well as on 
CircleCI.   I have a dockeriezed Tika service that does the OCR stuff, and I 
know it’s the same work on both.   It’s less exciting if I’m in an “all Java” 
world.

 
> On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <[email protected]> wrote:
> 
> Thank you, Thejan!
> 
> -----Original Message-----
> From: Thejan Wijesinghe [mailto:[email protected]] 
> Sent: Wednesday, May 31, 2017 5:40 PM
> To: [email protected]
> Subject: Re: experiences with Tika in Docker
> 
> Hi Tim,
> 
> I've used Tika -server in docker but as a single instance only. Yes, its 
> ability to limit container's resources with related to memory & CPU in the 
> host machine is great, it gives us so much flexibility, we could enforce 
> hard/soft memory limits, we could even manipulate the host machine's CPU 
> cycles. Yes, it also limits risks of executing arbitrary code & XXE 
> vulnerabilities. I already asked Prof. Chris Mattmann about officially moving 
> to dockerhub. He said I need to make a mail to apache infra asking about 
> this. Unfortunately, I still couldn't find a time to make that mail.
> 
> We already have multiple dockerfiles in Tika, , dockerfile in tika-server, 
> InceptionRestDockerfile, InceptionVideoRestDockerfile, 
> Im2txtRestDockerfile(PR #180-for image captioning).
> 
> Part of my GSoC project is to unify the existing REST services such as object 
> recognition, image captioning. My idea is to unify all of those REST services 
> where the user can start/terminate, see statistics of any REST service 
> through a web based GUI. I'm expecting to use a fusion of nginx(as the 
> reverse proxy server) & docker to make it work. So obviously we will see 
> docker much often in Tika.
> 
> +1 for your thought to looking into hardening the tika-server with the 
> +help
> of docker.
> 
> best,
> ThejanW
> 
> On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <[email protected]>
> wrote:
> 
>> Dave Meikle, Tom and All,
>> 
>>    How many of us are using Tika in Docker?  If so, how exactly are 
>> you using it?  Single instance, swarm, Kubernetes, something else?  
>> People fear I/O hit with tika-server...what are your experiences?
>> I really like the ability to limit the number of CPUs in the Docker 
>> container.  If a single doc causes multithreaded gc to go nuts, that 
>> won't kill an entire machine.  This also cleanly limits the risk from 
>> XXE or arbitrary code execution, right?
>> 
>> If this is one of the ways of the future for big data, we might want 
>> to look into hardening tika-server (OOMs, timeouts).  What do you all think?
>> 
>>        Cheers,
>> 
>>                Tim
>> 
>> Timothy B. Allison, Ph.D.
>> Principal Artificial Intelligence Engineer Group Lead K83E/Human 
>> Language Technology The MITRE Corporation
>> 7515 Colshire Drive, McLean, VA  22102
>> 703-983-2473 (phone); 703-983-1379 (fax)
>> 
>> 


_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to