Future of SolrCell (extraction module)

Jan Høydahl Fri, 10 Oct 2025 01:09:47 -0700

Hi,

Raising the awareness of a topic that was suggested some 10 years ago (See 
SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that may 
finally happen.
It's about evolving our Extraction module to use TikaServer intead of local 
in-process Tika jars.


In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an 
anti-pattern to process huge PDFs in Solr's JVM process.
So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the concept 
of Extraction Backends to the ExtractingRequestHandler, adding TikaServer as a 
new backend.

I'd really like to get rid of the weight of Tika jar dependencies in 10.0, 
which is soon to start release phase.
Switching to TikaServer in Solr 10 can make that happen. The PR is fairly 
mature, but needs more eyes before merge.

- Please voice your support for the approach
- More eyes on the Pull Request
- Test the PR branch on your own data (same API, just add extraction.backend 
and tikaserver.url to your RH config)

Jan

Future of SolrCell (extraction module)

Reply via email to