I’ve created this JIRA to track this work: 
https://issues.apache.org/jira/browse/TIKA-3010 
<https://issues.apache.org/jira/browse/TIKA-3010>

And a WIP progress PR is at https://github.com/apache/tika/pull/305

My thought is to put something together that mimics how we deploy Solr, and see 
how that works.   I have a need for an install process that a general IT person 
can follow, who isn’t a Tika expert or a Docker users.




> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <[email protected]> wrote:
> 
> Thanks for bringing this conversation up Eric.
> 
> 
> 
> Historically if you look over the last 5 years, I think what you are asking 
> below has sort of already become the de facto
> truth. Most people are in fact using Tika server, whether they are individual 
> devs, govvies, commercial folk and the like. 
> 
> Big, small and medium projects. Evidenced by the expansion of Tika APIs into 
> pretty much every PL I know and use of 
> actively today.
> 
> 
> 
> Given that, we probably should update the main website docs to make this more 
> prominent. The tika server docs on the
> wiki are pretty darn good. But they don’t get prime real estate. Would be 
> wonderful if someone wants to update the 
> website to make it more prominent.
> 
> 
> 
> The downstream Tika Python lib that I maintain has tons of activity is used 
> by more than 350+ projects and relies solely
> on Tika-Server. My recommendation to the Solr folks (having created 7633) 
> from the 2014 DARPA MEMEX days was to 
> move towards Tika Server based SolrCell dep and that’s the right way to go 
> IMO.
> 
> 
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Eric Pugh <[email protected] 
> <mailto:[email protected]>>
> Reply-To: "[email protected] <mailto:[email protected]>" 
> <[email protected] <mailto:[email protected]>>
> Date: Wednesday, December 4, 2019 at 12:24 PM
> To: "[email protected] <mailto:[email protected]>" <[email protected] 
> <mailto:[email protected]>>
> Subject: [EXTERNAL] Do we have a community supported approach for deploying 
> Tika Server in production?
> 
> 
> 
> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
> 
> 
> 
> Over in Solr land there has been renewed discussion about streamlining what 
> Solr is....   
> 
> 
> 
> In regards to rich content extraction and the Tika project, it seems like the 
> two ideas that continue to preserve the existing behavior are:
> 
> 
> 
> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.  
>  This slims down the standard Solr download, and *might* make it easier to 
> update the version of Tika + dependent jars used?
> 
> 
> 
> 2) The second approach is to instead require Tika-Server to be running 
> (https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate 
> the call to Tika-Server.
> 
> 
> 
> 
> 
> I was thinking about why I like option 1 better than 2, and I think it boils 
> down to how mature the IT organization I am working with is.  Some IT 
> organizations have large dev-ops teams, and are working at major scale, and 
> managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically 
> scaling up and down is simple and second nature!  However, many organizations 
> aren’t like that.
> 
> 
> 
> So I guess what I’m asking is do we have a reasonable supported approach for 
> deploying Tika Server for non-tika savvy organizations?   I’m thinking about 
> Solr, and specifically the fact that Solr has a well defined set of Service 
> Installation scripts.   When I follow the directions in 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>  I can feel confident that when the server is rebooted, then Solr will come 
> back up!   Plus there is log rotation and all the rest.
> 
> 
> 
> In contrast, when I look at Tika website, specifically 
> https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run 
> Tika as a command line application, or embedded in your application.   
> 
> 
> 
> I’m wondering if Tika-Server needs to be made more prominent, and treated as 
> the “primary method of interacting with Tika”?   Do we need as a community to 
> focus more on Tika-Server?   In our getting started documentation, in our 
> usage documentation, and in our examples?
> 
> 
> 
> Do we need to create the equivalent of the Service Installation scripts for 
> Tika-Server?   
> 
> 
> 
> Wanted to stoke the discussion!
> 
> 
> 
> Eric
> 
> 
> 
> _______________________
> 
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com 
> <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ 
> <http://www.opensourceconnections.com/>> | My Free/Busy 
> <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>  
> 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>  
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>
>        
> 
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Reply via email to