Re: Future of SolrCell (extraction module)

Jan Høydahl Thu, 16 Oct 2025 07:23:54 -0700

The Tika Server implementation of Solr Cell is now merged to main and solr 10 
branches. Please take it for a spin and report bugs before the 10.0 release. 
There are probably some rough edges.


Thanks for review help, especially Eric Pugh!

Jan

> 13. okt. 2025 kl. 11:01 skrev Jan Høydahl <[email protected]>:
> 
> Thanks for replies.
> 
> Gus, I agree that serious extraction projects would likely setup custom 
> pipelines and software.
> That does not make SolrCell useless, and this upgrade will give it a longer 
> life without the downsides.
> 
> Christos, the road forward for embedded Tika seems to be making a backend 
> using TikaPipes with Tika3.
> 
> Ishan, I believe you'll be able to code a Docling backend for SolrCell once 
> this lands.
> 
> From my end the PR is complete, and I plan to merge soon.
> Being a fairly large PR I'd appreciate even more feedback and real life 
> testing. And proof reading docs.
> 
> Jan Høydahl
> 
>> 11. okt. 2025 kl. 11:28 skrev Christos Malliaridis <[email protected]>:
>> 
>> +1 from me too. I like the abstraction layer your PR introduces and I would
>> also make this a blocking matter for Solr 10 release.
>> 
>> I have only one question: What is the plan of upgrading Tika to 3.x and
>> what impact does it have on the current PR / approach? I believe it would
>> be beneficial to upgrade it to the latest version as well somehow if the
>> TikaServerExtractionBackend is affected by that.
>> 
>> ---
>> Christos
>> 
>> On Fri, Oct 10, 2025 at 11:09 AM Jan Høydahl <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> Raising the awareness of a topic that was suggested some 10 years ago (See
>>> SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that
>>> may finally happen.
>>> It's about evolving our Extraction module to use TikaServer intead of
>>> local in-process Tika jars.
>>> 
>>> In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an
>>> anti-pattern to process huge PDFs in Solr's JVM process.
>>> So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the
>>> concept of Extraction Backends to the ExtractingRequestHandler, adding
>>> TikaServer as a new backend.
>>> 
>>> I'd really like to get rid of the weight of Tika jar dependencies in 10.0,
>>> which is soon to start release phase.
>>> Switching to TikaServer in Solr 10 can make that happen. The PR is fairly
>>> mature, but needs more eyes before merge.
>>> 
>>> - Please voice your support for the approach
>>> - More eyes on the Pull Request
>>> - Test the PR branch on your own data (same API, just add
>>> extraction.backend and tikaserver.url to your RH config)
>>> 
>>> Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Future of SolrCell (extraction module)

Reply via email to