Hi folks! I’ve got a mostly completed PR for having install scripts for Tika Server, and I’m hoping a committer will take a look at the PR, and give feedback (and ideally commit in time for 1.24!)
A couple of things: 1) This was completely influenced by https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>, in fact I started with the Solr scripts. 2) I’ve deleted all the Solr specific aspects (I think), however there may still be more to delete. 3) This requires a change to how we release Tika, previously we ship tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we want to add the tika-server-bin.tgz and tika-server-bin.zip binary distributions. I’m happy to start writing accompanying “how to deploy Tika Server” docs if this PR looks good! Or, please give input and I’ll make the updates. Eric > On Dec 12, 2019, at 2:39 PM, Eric Pugh <[email protected]> > wrote: > > I’ve created this JIRA to track this work: > https://issues.apache.org/jira/browse/TIKA-3010 > <https://issues.apache.org/jira/browse/TIKA-3010> > > And a WIP progress PR is at https://github.com/apache/tika/pull/305 > <https://github.com/apache/tika/pull/305> > > My thought is to put something together that mimics how we deploy Solr, and > see how that works. I have a need for an install process that a general IT > person can follow, who isn’t a Tika expert or a Docker users. > > > > >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <[email protected] >> <mailto:[email protected]>> wrote: >> >> Thanks for bringing this conversation up Eric. >> >> >> >> Historically if you look over the last 5 years, I think what you are asking >> below has sort of already become the de facto >> truth. Most people are in fact using Tika server, whether they are >> individual devs, govvies, commercial folk and the like. >> >> Big, small and medium projects. Evidenced by the expansion of Tika APIs into >> pretty much every PL I know and use of >> actively today. >> >> >> >> Given that, we probably should update the main website docs to make this >> more prominent. The tika server docs on the >> wiki are pretty darn good. But they don’t get prime real estate. Would be >> wonderful if someone wants to update the >> website to make it more prominent. >> >> >> >> The downstream Tika Python lib that I maintain has tons of activity is used >> by more than 350+ projects and relies solely >> on Tika-Server. My recommendation to the Solr folks (having created 7633) >> from the 2014 DARPA MEMEX days was to >> move towards Tika Server based SolrCell dep and that’s the right way to go >> IMO. >> >> >> >> Chris >> >> >> >> >> >> >> >> >> >> >> >> From: Eric Pugh <[email protected] >> <mailto:[email protected]>> >> Reply-To: "[email protected] <mailto:[email protected]>" >> <[email protected] <mailto:[email protected]>> >> Date: Wednesday, December 4, 2019 at 12:24 PM >> To: "[email protected] <mailto:[email protected]>" <[email protected] >> <mailto:[email protected]>> >> Subject: [EXTERNAL] Do we have a community supported approach for deploying >> Tika Server in production? >> >> >> >> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question! >> >> >> >> Over in Solr land there has been renewed discussion about streamlining what >> Solr is.... >> >> >> >> In regards to rich content extraction and the Tika project, it seems like >> the two ideas that continue to preserve the existing behavior are: >> >> >> >> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. >> This slims down the standard Solr download, and *might* make it easier to >> update the version of Tika + dependent jars used? >> >> >> >> 2) The second approach is to instead require Tika-Server to be running >> (https://issues.apache.org/jira/browse/SOLR-7633 >> <https://issues.apache.org/jira/browse/SOLR-7633>) and just have Solr >> delegate the call to Tika-Server. >> >> >> >> >> >> I was thinking about why I like option 1 better than 2, and I think it boils >> down to how mature the IT organization I am working with is. Some IT >> organizations have large dev-ops teams, and are working at major scale, and >> managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically >> scaling up and down is simple and second nature! However, many >> organizations aren’t like that. >> >> >> >> So I guess what I’m asking is do we have a reasonable supported approach for >> deploying Tika Server for non-tika savvy organizations? I’m thinking about >> Solr, and specifically the fact that Solr has a well defined set of Service >> Installation scripts. When I follow the directions in >> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production >> >> <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production> >> I can feel confident that when the server is rebooted, then Solr will come >> back up! Plus there is log rotation and all the rest. >> >> >> >> In contrast, when I look at Tika website, specifically >> https://tika.apache.org/1.22/gettingstarted.htm >> <https://tika.apache.org/1.22/gettingstarted.htm> pagel, the message is to >> run Tika as a command line application, or embedded in your application. >> >> >> >> I’m wondering if Tika-Server needs to be made more prominent, and treated as >> the “primary method of interacting with Tika”? Do we need as a community >> to focus more on Tika-Server? In our getting started documentation, in our >> usage documentation, and in our examples? >> >> >> >> Do we need to create the equivalent of the Service Installation scripts for >> Tika-Server? >> >> >> >> Wanted to stoke the discussion! >> >> >> >> Eric >> >> >> >> _______________________ >> >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | >> http://www.opensourceconnections.com >> <http://www.opensourceconnections.com/><http://www.opensourceconnections.com/ >> <http://www.opensourceconnections.com/>> | My Free/Busy >> <http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>> >> >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed >> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw >> >> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>> >> >> >> This e-mail and all contents, including attachments, is considered to be >> Company Confidential unless explicitly stated otherwise, regardless of >> whether attachments are marked as such. > > _______________________ > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > http://www.opensourceconnections.com <http://www.opensourceconnections.com/> > | My Free/Busy <http://tinyurl.com/eric-cal> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed > <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > This e-mail and all contents, including attachments, is considered to be > Company Confidential unless explicitly stated otherwise, regardless of > whether attachments are marked as such. > _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
