[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

Jira Thu, 22 May 2025 03:29:46 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953398#comment-17953398
 ]


Jan Høydahl commented on SOLR-7632:
-----------------------------------

Guys, this effort has been dormant for 2 years. I had a branch where I started 
some experimentation but got lost by the lack of a java-client for Tika Server 
and somewhat poor documentation.

Yesterday I watched the Google IO keynote and started playing with the Jules 
tool. So in the search for a problem to throw at it, I figured why not find a 
solr issue that is non-trivial, and see what it can make of it. So I promoted 
it as follows:
{quote}Read https://issues.apache.org/jira/browse/SOLR-7632 which proposes to 
deprecate the old "extraction" module, and replace it with an api-compatible 
new module that instead of parsing rich text documents in-process with Tika, 
will delegate to an externally running Tika-Server. More discussion can be 
found in https://lists.apache.org/thread/lbm6wb88gd1cfktgs6sfvw5xf73o8trd.

Do not focus on deprecating the old module yet. Just make a working PR for the 
new module. You can assume that the user has provisioned a TikaServer on some 
URL. When writing tests for the module, a good idea could be to look at the 
existing tests for "extraction" handler. You can choose whether you mock 
TikaServer API in thests or spin up a TIkaServer using TestContainers.

The PullRequest should also add reference guide documentation for the new 
feature. 
{quote}
At first it used Apache HTTPClient and gave the new module an awkward name, so 
I prompted it to change those two things with an additional prompt:
{quote}Please don't use Apache httpClient. Use Jetty httpclient instead, or JDK 
httpclient. Please name the module "tika"
{quote}
And this PR is what it came up with after about 30min: 
[https://github.com/apache/solr/pull/3361] 

I have just skimmed the code and not tested it at all, but thought the 
experiment was so interesting that I wanted to share it in a (Draft) PR which 
will also run the tests. I do not have hopes of any production ready code, and 
perhaps there will be push-back on the legality of using such a large 
contribution from AI. But it gives a glimpse into how the future of software 
development may change.

Next I'll look more into the code and tests and make up my own mind as to 
whether this is good stuff and a possible starting point for the new module.

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7632
>                 URL: https://issues.apache.org/jira/browse/SOLR-7632
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>            Assignee: Jan Høydahl
>            Priority: Major
>              Labels: gsoc2017, memex, pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika 
> fails it messes up the ExtractingRequestHandler (e.g., the document type 
> caused Tika to fail, etc). A more reliable way and also separated, and easier 
> to deploy version of the ExtractingRequestHandler would make a network call 
> to the Tika JAXRS server, and then call Tika on the Solr server side, get the 
> results and then index the information that way. I have a patch in the works 
> from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

Reply via email to