On a related, but separate note, I want us to consider integrating Docling in Solr. It is the current state of the art in document extraction according to some sources from whom I came to know about it. If this integration can be done via Tika, that would be nice. Otherwise, Docling has a Java library that we can consider integrating directly in Solr.
On Fri, 10 Oct, 2025, 1:39 pm Jan Høydahl, <[email protected]> wrote: > Hi, > > Raising the awareness of a topic that was suggested some 10 years ago (See > SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that > may finally happen. > It's about evolving our Extraction module to use TikaServer intead of > local in-process Tika jars. > > In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an > anti-pattern to process huge PDFs in Solr's JVM process. > So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the > concept of Extraction Backends to the ExtractingRequestHandler, adding > TikaServer as a new backend. > > I'd really like to get rid of the weight of Tika jar dependencies in 10.0, > which is soon to start release phase. > Switching to TikaServer in Solr 10 can make that happen. The PR is fairly > mature, but needs more eyes before merge. > > - Please voice your support for the approach > - More eyes on the Pull Request > - Test the PR branch on your own data (same API, just add > extraction.backend and tikaserver.url to your RH config) > > Jan
