I’ve created this JIRA to track this work: https://issues.apache.org/jira/browse/TIKA-3010 <https://issues.apache.org/jira/browse/TIKA-3010>
And a WIP progress PR is at https://github.com/apache/tika/pull/305 My thought is to put something together that mimics how we deploy Solr, and see how that works. I have a need for an install process that a general IT person can follow, who isn’t a Tika expert or a Docker users. > On Dec 4, 2019, at 12:28 PM, Chris Mattmann <[email protected]> wrote: > > Thanks for bringing this conversation up Eric. > > > > Historically if you look over the last 5 years, I think what you are asking > below has sort of already become the de facto > truth. Most people are in fact using Tika server, whether they are individual > devs, govvies, commercial folk and the like. > > Big, small and medium projects. Evidenced by the expansion of Tika APIs into > pretty much every PL I know and use of > actively today. > > > > Given that, we probably should update the main website docs to make this more > prominent. The tika server docs on the > wiki are pretty darn good. But they don’t get prime real estate. Would be > wonderful if someone wants to update the > website to make it more prominent. > > > > The downstream Tika Python lib that I maintain has tons of activity is used > by more than 350+ projects and relies solely > on Tika-Server. My recommendation to the Solr folks (having created 7633) > from the 2014 DARPA MEMEX days was to > move towards Tika Server based SolrCell dep and that’s the right way to go > IMO. > > > > Chris > > > > > > > > > > > > From: Eric Pugh <[email protected] > <mailto:[email protected]>> > Reply-To: "[email protected] <mailto:[email protected]>" > <[email protected] <mailto:[email protected]>> > Date: Wednesday, December 4, 2019 at 12:24 PM > To: "[email protected] <mailto:[email protected]>" <[email protected] > <mailto:[email protected]>> > Subject: [EXTERNAL] Do we have a community supported approach for deploying > Tika Server in production? > > > > Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question! > > > > Over in Solr land there has been renewed discussion about streamlining what > Solr is.... > > > > In regards to rich content extraction and the Tika project, it seems like the > two ideas that continue to preserve the existing behavior are: > > > > 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. > This slims down the standard Solr download, and *might* make it easier to > update the version of Tika + dependent jars used? > > > > 2) The second approach is to instead require Tika-Server to be running > (https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate > the call to Tika-Server. > > > > > > I was thinking about why I like option 1 better than 2, and I think it boils > down to how mature the IT organization I am working with is. Some IT > organizations have large dev-ops teams, and are working at major scale, and > managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically > scaling up and down is simple and second nature! However, many organizations > aren’t like that. > > > > So I guess what I’m asking is do we have a reasonable supported approach for > deploying Tika Server for non-tika savvy organizations? I’m thinking about > Solr, and specifically the fact that Solr has a well defined set of Service > Installation scripts. When I follow the directions in > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production > I can feel confident that when the server is rebooted, then Solr will come > back up! Plus there is log rotation and all the rest. > > > > In contrast, when I look at Tika website, specifically > https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run > Tika as a command line application, or embedded in your application. > > > > I’m wondering if Tika-Server needs to be made more prominent, and treated as > the “primary method of interacting with Tika”? Do we need as a community to > focus more on Tika-Server? In our getting started documentation, in our > usage documentation, and in our examples? > > > > Do we need to create the equivalent of the Service Installation scripts for > Tika-Server? > > > > Wanted to stoke the discussion! > > > > Eric > > > > _______________________ > > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com > <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ > <http://www.opensourceconnections.com/>> | My Free/Busy > <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>> > > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw > > <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>> > > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
