Re: [DISCUSS] Future of SolrCell in Solr

Jan Høydahl Thu, 23 Mar 2023 11:43:24 -0700

Documentation wise we can re-write the chapter we have on rich text indexing to 
mention several options, including tika-server, tika-pipes with solr emitter.


Wrt SolrCell successor, I still think a super-thin module forwarding to 
TikaServer is the best. Users would get same features and API as today, so 
users who rely on SolrCell have a simple migration path. It may also be a 
benefit that they get better control over their Tika Server wrt version, 
scaling, what parsers are included etc. I want to do a quick POC on this to see 
how it flies.

Jan

> 23. mar. 2023 kl. 17:14 skrev Tim Allison <[email protected]>:
> 
> Apologies for being late to the show, and thank you Eric for pinging me on 
> this.
> 
> I'm 100% for factoring out Tika from the same jvm as Solr.  I see three 
> options for removing Tika from Solr's jvm, making it easier for users and 
> keeping Tika's jar hell all to itself.
> 
> 1) As already proposed, use Tika server and somehow figure out how to 
> integrate that seamlessly.
> 
> 2) Use Tika pipes within Solr directly or within a package (as Eric suggest). 
>  This forks a process for parsing, and all the heavy dependencies go into the 
> forked process.  Solr would need tika-core, but could specify a directory 
> with tika-app.jar in it.  The dependency nightmare in tika-app.jar would not 
> get loaded into Solr's jvm.  We'd probably have to make some mods to 
> tika-pipes for this to work roughly as Tika is being used now, but I think 
> something like this is doable...
> 
> 3) Direct users to tika-pipes directly.  We have a Solr emitter.  Users can 
> aim tika-pipes at a directory of files, an S3 bucket, a gcs thing, etc, and 
> Tika will safely parse the files in a forked process and forward the results 
> to Solr.  This is not as easy as curling bytes to Solr and having those bytes 
> parsed, but it is possible.
> 
> Please let me know how I can help.
> 
> Best,
> 
>    Tim
> 
> On 2023/03/10 03:57:45 Gus Heck wrote:
>> While I totally think that for any heavy-duty use case or any use case
>> where the document's are not constrained to a known set with polite
>> characteristics (i.e. known not to be password protected, reasonable
>> length, etc), Tika should not run inside solr. That said, as I see it the
>> key downside of not having solr-cell as part of solr would be that we would
>> likely  remove the docs for it too, and the entire concept of how to get a
>> "normal" document into solr evaporates from our ref guide. So I like the
>> sound of it being an official package as Eric suggests, and perhaps even
>> the canonical example of how to install a package... Along with heavy
>> documentation caveats of why Tika should run outside of solr for most
>> production purposes of course.
>> 
>> -Gus
>> 
>> 
>> On Thu, Mar 9, 2023 at 8:09 AM Eric Pugh <[email protected]>
>> wrote:
>> 
>>> I did a series of blog posts about Tika, and while conventional wisdom is
>>> that running Tika in Solr is bad, I’ve had GREAT luck with it over the
>>> years.
>>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
>>> <
>>> https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/
>>>> 
>>> 
>>> Having said that, my bigger beef with Tika in Solr is about all the
>>> dependencies that it drags along.   I am constantly looking up a package
>>> wondering how we use it in Solr just to find it’s a Tika package….  So….
>>> For that reason I think we need to do something better.
>>> 
>>> I like SolrCell to a package (
>>> https://issues.apache.org/jira/browse/SOLR-15951 <
>>> https://issues.apache.org/jira/browse/SOLR-15951>).   We have this
>>> powerful packaging feature, and yet we hardly dog food it ourselves….  I’d
>>> love to see us separate out SolrCell and make it easy to do `bin/solr
>>> package install solrcell` and have it work!  It would both validate the
>>> whole Package concept, and minimize the dependencies in Solr’s tarball.
>>> 
>>> Secondly, for folks who really do want to run a separate Tika server, I’d
>>> love to make it easier to use.    Tika has introduced a new “pipes” concept
>>> to reduce the amount of back and forth when working with Tika Server that
>>> might tie nicely into the Solr update pipeline.  I don’t think any real
>>> work has been done on this…. Hoping Tim Allison weighs in on this topic ;-)
>>> 
>>> Eric
>>> 
>>> 
>>>> On Mar 8, 2023, at 9:50 PM, Shawn Heisey <[email protected]> wrote:
>>>> 
>>>> On 3/7/2023 3:48 PM, Jan Høydahl wrote:
>>>>> * Move SolrCell to a package, outside of Solr's tarball SOLR-15951 <
>>> https://issues.apache.org/jira/browse/SOLR-15951>
>>>>> * Deprecate SolrCell SOLR-13973 <
>>> https://issues.apache.org/jira/browse/SOLR-13973>
>>>>> * Keep in Solr but use Tika-Server <
>>> https://cwiki.apache.org/confluence/display/TIKA/TikaServer>,  SOLR-7632 <
>>> https://issues.apache.org/jira/browse/SOLR-7632>
>>>>> * Integrate Tika client-side SOLR-1526 <
>>> https://issues.apache.org/jira/browse/SOLR-1526>
>>>> 
>>>> As you likely know, the big problem is that Tika has a habit of crashing
>>> or misbehaving, particularly with PDFs, and if it's running inside Solr,
>>> then Solr itself is going to suffer whatever bad effects Tika causes.
>>>> 
>>>>> My current thinking / proposal is to:
>>>>> * Build a new, thin Solr module that exposes a compatible
>>> /update/extract handler, delegating to Tika-Server (user-hosted)
>>>>> * Deprecate SolrCell in current form
>>>>> * From 10.0, Solr will not ship with embedded Tika, only the new
>>> handler delegating to Tika-Server
>>>> 
>>>> I was thinking something along these lines too.  A separate JVM running
>>> Tika Server that can crash without taking Solr down, and communication so
>>> ERH can send commands to it, receive extracted data, and hopefully know
>>> when the other JVM crashes.  If we design it well, then the framework could
>>> be used to integrate with other extraction mechanisms besides Tika.  I
>>> think that would be quite a bit of work.
>>>> 
>>>> It might be a good idea to make that a separate project as was done for
>>> DIH, but I have no way of guessing whether there is enough interest in the
>>> community to keep it maintained.  If it's a separate project, then I think
>>> it would just incorporate SolrJ and Tika, rather than using a special
>>> handler.  I have never used ERH in a production setting, and barely have
>>> experience with it in non-production.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>> 
>>> _______________________
>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
>>> http://www.opensourceconnections.com <
>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>> http://tinyurl.com/eric-cal>
>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>>> 
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless of
>>> whether attachments are marked as such.
>>> 
>>> 
>> 
>> -- 
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Future of SolrCell in Solr

Reply via email to