The Tika Server implementation of Solr Cell is now merged to main and solr 10 branches. Please take it for a spin and report bugs before the 10.0 release. There are probably some rough edges.
Thanks for review help, especially Eric Pugh! Jan > 13. okt. 2025 kl. 11:01 skrev Jan Høydahl <[email protected]>: > > Thanks for replies. > > Gus, I agree that serious extraction projects would likely setup custom > pipelines and software. > That does not make SolrCell useless, and this upgrade will give it a longer > life without the downsides. > > Christos, the road forward for embedded Tika seems to be making a backend > using TikaPipes with Tika3. > > Ishan, I believe you'll be able to code a Docling backend for SolrCell once > this lands. > > From my end the PR is complete, and I plan to merge soon. > Being a fairly large PR I'd appreciate even more feedback and real life > testing. And proof reading docs. > > Jan Høydahl > >> 11. okt. 2025 kl. 11:28 skrev Christos Malliaridis <[email protected]>: >> >> +1 from me too. I like the abstraction layer your PR introduces and I would >> also make this a blocking matter for Solr 10 release. >> >> I have only one question: What is the plan of upgrading Tika to 3.x and >> what impact does it have on the current PR / approach? I believe it would >> be beneficial to upgrade it to the latest version as well somehow if the >> TikaServerExtractionBackend is affected by that. >> >> --- >> Christos >> >> On Fri, Oct 10, 2025 at 11:09 AM Jan Høydahl <[email protected]> wrote: >> >>> Hi, >>> >>> Raising the awareness of a topic that was suggested some 10 years ago (See >>> SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that >>> may finally happen. >>> It's about evolving our Extraction module to use TikaServer intead of >>> local in-process Tika jars. >>> >>> In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an >>> anti-pattern to process huge PDFs in Solr's JVM process. >>> So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the >>> concept of Extraction Backends to the ExtractingRequestHandler, adding >>> TikaServer as a new backend. >>> >>> I'd really like to get rid of the weight of Tika jar dependencies in 10.0, >>> which is soon to start release phase. >>> Switching to TikaServer in Solr 10 can make that happen. The PR is fairly >>> mature, but needs more eyes before merge. >>> >>> - Please voice your support for the approach >>> - More eyes on the Pull Request >>> - Test the PR branch on your own data (same API, just add >>> extraction.backend and tikaserver.url to your RH config) >>> >>> Jan --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
