+1 from me too. I like the abstraction layer your PR introduces and I would
also make this a blocking matter for Solr 10 release.

I have only one question: What is the plan of upgrading Tika to 3.x and
what impact does it have on the current PR / approach? I believe it would
be beneficial to upgrade it to the latest version as well somehow if the
TikaServerExtractionBackend is affected by that.

---
Christos

On Fri, Oct 10, 2025 at 11:09 AM Jan Høydahl <[email protected]> wrote:

> Hi,
>
> Raising the awareness of a topic that was suggested some 10 years ago (See
> SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that
> may finally happen.
> It's about evolving our Extraction module to use TikaServer intead of
> local in-process Tika jars.
>
> In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an
> anti-pattern to process huge PDFs in Solr's JVM process.
> So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the
> concept of Extraction Backends to the ExtractingRequestHandler, adding
> TikaServer as a new backend.
>
> I'd really like to get rid of the weight of Tika jar dependencies in 10.0,
> which is soon to start release phase.
> Switching to TikaServer in Solr 10 can make that happen. The PR is fairly
> mature, but needs more eyes before merge.
>
> - Please voice your support for the approach
> - More eyes on the Pull Request
> - Test the PR branch on your own data (same API, just add
> extraction.backend and tikaserver.url to your RH config)
>
> Jan

Reply via email to