Apologies for being late to the show, and thank you Eric for pinging me on this.

I'm 100% for factoring out Tika from the same jvm as Solr.  I see three options 
for removing Tika from Solr's jvm, making it easier for users and keeping 
Tika's jar hell all to itself.

1) As already proposed, use Tika server and somehow figure out how to integrate 
that seamlessly.

2) Use Tika pipes within Solr directly or within a package (as Eric suggest).  
This forks a process for parsing, and all the heavy dependencies go into the 
forked process.  Solr would need tika-core, but could specify a directory with 
tika-app.jar in it.  The dependency nightmare in tika-app.jar would not get 
loaded into Solr's jvm.  We'd probably have to make some mods to tika-pipes for 
this to work roughly as Tika is being used now, but I think something like this 
is doable...

3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can aim 
tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and Tika 
will safely parse the files in a forked process and forward the results to 
Solr.  This is not as easy as curling bytes to Solr and having those bytes 
parsed, but it is possible.

Please let me know how I can help.

Best,

    Tim

On 2023/03/10 03:57:45 Gus Heck wrote:
> While I totally think that for any heavy-duty use case or any use case
> where the document's are not constrained to a known set with polite
> characteristics (i.e. known not to be password protected, reasonable
> length, etc), Tika should not run inside solr. That said, as I see it the
> key downside of not having solr-cell as part of solr would be that we would
> likely  remove the docs for it too, and the entire concept of how to get a
> "normal" document into solr evaporates from our ref guide. So I like the
> sound of it being an official package as Eric suggests, and perhaps even
> the canonical example of how to install a package... Along with heavy
> documentation caveats of why Tika should run outside of solr for most
> production purposes of course.
> 
> -Gus
> 
> 
> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <ep...@opensourceconnections.com>
> wrote:
> 
> > I did a series of blog posts about Tika, and while conventional wisdom is
> > that running Tika in Solr is bad, I’ve had GREAT luck with it over the
> > years.
> > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> > <
> > https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
> > >
> >
> > Having said that, my bigger beef with Tika in Solr is about all the
> > dependencies that it drags along.   I am constantly looking up a package
> > wondering how we use it in Solr just to find it’s a Tika package….  So….
> > For that reason I think we need to do something better.
> >
> > I like SolrCell to a package (
> > https://issues.apache.org/jira/browse/SOLR-15951 <
> > https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
> > powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
> > love to see us separate out SolrCell and make it easy to do `bin/solr
> > package install solrcell` and have it work!  It would both validate the
> > whole Package concept, and minimize the dependencies in Solr’s tarball.
> >
> > Secondly, for folks who really do want to run a separate Tika server, I’d
> > love to make it easier to use.    Tika has introduced a new “pipes” concept
> > to reduce the amount of back and forth when working with Tika Server that
> > might tie nicely into the Solr update pipeline.  I don’t think any real
> > work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
> >
> > Eric
> >
> >
> > > On Mar 8, 2023, at 9:50 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> > >
> > > On 3/7/2023 3:48 PM, Jan Høydahl wrote:
> > >> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
> > https://issues.apache.org/jira/browse/SOLR-15951>
> > >> * Deprecate SolrCell SOLR-13973 <
> > https://issues.apache.org/jira/browse/SOLR-13973>
> > >> * Keep in Solr but use Tika-Server <
> > https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
> > https://issues.apache.org/jira/browse/SOLR-7632>
> > >> * Integrate Tika client-side SOLR-1526 <
> > https://issues.apache.org/jira/browse/SOLR-1526>
> > >
> > > As you likely know, the big problem is that Tika has a habit of crashing
> > or misbehaving, particularly with PDFs, and if it's running inside Solr,
> > then Solr itself is going to suffer whatever bad effects Tika causes.
> > >
> > >> My current thinking / proposal is to:
> > >> * Build a new, thin Solr module that exposes a compatible
> > /update/extract handler, delegating to Tika-Server (user-hosted)
> > >> * Deprecate SolrCell in current form
> > >> * From 10.0, Solr will not ship with embedded Tika, only the new
> > handler delegating to Tika-Server
> > >
> > > I was thinking something along these lines too.  A separate JVM running
> > Tika Server that can crash without taking Solr down, and communication so
> > ERH can send commands to it, receive extracted data, and hopefully know
> > when the other JVM crashes.  If we design it well, then the framework could
> > be used to integrate with other extraction mechanisms besides Tika.  I
> > think that would be quite a bit of work.
> > >
> > > It might be a good idea to make that a separate project as was done for
> > DIH, but I have no way of guessing whether there is enough interest in the
> > community to keep it maintained.  If it's a separate project, then I think
> > it would just incorporate SolrJ and Tika, rather than using a special
> > handler.  I have never used ERH in a production setting, and barely have
> > experience with it in non-production.
> > >
> > > Thanks,
> > > Shawn
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > For additional commands, e-mail: dev-h...@solr.apache.org
> > >
> >
> > _______________________
> > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> > http://www.opensourceconnections.com <
> > http://www.opensourceconnections.com/> | My Free/Busy <
> > http://tinyurl.com/eric-cal>
> > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
> >
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless of
> > whether attachments are marked as such.
> >
> >
> 
> -- 
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Reply via email to