Re: Future of SolrCell (extraction module)

Gus Heck Fri, 17 Oct 2025 18:28:11 -0700

+1 running Tika inside of Solr's JVM is definitely an anti-pattern for
anything but the most trivial systems. I'd actually advocate using a
ETL/Ingestion/Pipeline system entirely separate from Solr, but in the
absence of that, Solr talking to a Tika Server would still better than
nothing. Honestly if you're at a point of consuming real documents (instead
of test example documents) it's time to start setting up realistic
configurations too. The very brief window where SolrCell might be useful
will be chocked full of throwaway work, or lead most folks into problems
down the road if they don't throw away the work.


Playgrounds and tutorials for learning Solr's features can work with
pre-canned data and don't need extraction.



On Fri, Oct 10, 2025 at 4:22 PM David Smiley <[email protected]> wrote:

> The approach sounds good.  I look forward to a smaller Solr distribution
> with fewer CVE risks / burden.  I've not used the module so I don't offer
> further input.
>
> On Fri, Oct 10, 2025 at 4:09 AM Jan Høydahl <[email protected]> wrote:
>
> > Hi,
> >
> > Raising the awareness of a topic that was suggested some 10 years ago
> (See
> > SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that
> > may finally happen.
> > It's about evolving our Extraction module to use TikaServer intead of
> > local in-process Tika jars.
> >
> > In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an
> > anti-pattern to process huge PDFs in Solr's JVM process.
> > So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the
> > concept of Extraction Backends to the ExtractingRequestHandler, adding
> > TikaServer as a new backend.
> >
> > I'd really like to get rid of the weight of Tika jar dependencies in
> 10.0,
> > which is soon to start release phase.
> > Switching to TikaServer in Solr 10 can make that happen. The PR is fairly
> > mature, but needs more eyes before merge.
> >
> > - Please voice your support for the approach
> > - More eyes on the Pull Request
> > - Test the PR branch on your own data (same API, just add
> > extraction.backend and tikaserver.url to your RH config)
> >
> > Jan
>


-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: Future of SolrCell (extraction module)

Reply via email to