Thanks for replies.

Gus, I agree that serious extraction projects would likely setup custom 
pipelines and software.
That does not make SolrCell useless, and this upgrade will give it a longer 
life without the downsides.

Christos, the road forward for embedded Tika seems to be making a backend using 
TikaPipes with Tika3.

Ishan, I believe you'll be able to code a Docling backend for SolrCell once 
this lands.

From my end the PR is complete, and I plan to merge soon.
Being a fairly large PR I'd appreciate even more feedback and real life 
testing. And proof reading docs.

Jan Høydahl

> 11. okt. 2025 kl. 11:28 skrev Christos Malliaridis <[email protected]>:
> 
> +1 from me too. I like the abstraction layer your PR introduces and I would
> also make this a blocking matter for Solr 10 release.
> 
> I have only one question: What is the plan of upgrading Tika to 3.x and
> what impact does it have on the current PR / approach? I believe it would
> be beneficial to upgrade it to the latest version as well somehow if the
> TikaServerExtractionBackend is affected by that.
> 
> ---
> Christos
> 
> On Fri, Oct 10, 2025 at 11:09 AM Jan Høydahl <[email protected]> wrote:
> 
>> Hi,
>> 
>> Raising the awareness of a topic that was suggested some 10 years ago (See
>> SOLR-7632 <https://issues.apache.org/jira/browse/SOLR-7632>), and that
>> may finally happen.
>> It's about evolving our Extraction module to use TikaServer intead of
>> local in-process Tika jars.
>> 
>> In Solr 9.x we have Tika 1.x jars, which is end of life. It is also an
>> anti-pattern to process huge PDFs in Solr's JVM process.
>> So in PR #3670 <https://github.com/apache/solr/pull/3670> I added the
>> concept of Extraction Backends to the ExtractingRequestHandler, adding
>> TikaServer as a new backend.
>> 
>> I'd really like to get rid of the weight of Tika jar dependencies in 10.0,
>> which is soon to start release phase.
>> Switching to TikaServer in Solr 10 can make that happen. The PR is fairly
>> mature, but needs more eyes before merge.
>> 
>> - Please voice your support for the approach
>> - More eyes on the Pull Request
>> - Test the PR branch on your own data (same API, just add
>> extraction.backend and tikaserver.url to your RH config)
>> 
>> Jan

Reply via email to