I tried to find a java client for tika, but with no success so far. The version upgrade would reduce the vulnerabilities from about 21 CVEs to 6, so it would definitely be an improvement and probably worth the migration effort until a client is available.
On Mon, 12 Aug 2024, 18:15 Jan Høydahl, <jan....@cominvent.com> wrote: > Hi > > Wrt Tika, I had been hoping that we could replace extracting handler with > a processor that delegates to Tika Server, but is otherwise feature parity. > It would remove tons of dependencies and attack surface from Solr. > > I tried a POC once but could not find a suitable Java client for Tika > Server REST API. Perhaps that exists now? > > Jan Høydahl > > > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis < > c.malliari...@gmail.com>: > > > > Hello everyone, > > > > I've been looking into the dependencies of the project and thought that > we > > could update a couple of them, together with their license files > (wherever > > necessary). > > > > I tried to start with Apache Tika and upgrade it from 1.28.5 to 2.9.2, > > which is a huge step due to some restructuring of Apache Tika. The > affected > > modules are extraction and langid. > > > > There is a PR from solrbot <https://github.com/apache/solr/pull/2583> > that > > requires some manual work that I have already picked up for learning > > purposes. I'd like to create a ticket for the upgrade, but also saw that > > there is also SOLR-13973 > > <https://issues.apache.org/jira/browse/SOLR-13973> that > > is titled "Deprecate Tika". From the age and conversation on the ticket, > it > > sounds like Tika will not be deprecated and the ticket can be closed. > But I > > am not sure and would like to ask for your input on this. > > > > In the migration to 2.9.2 it seems that there are some conflicts with the > > way the title from documents is extracted. Some metadata tags have also > > been removed / replaced, which needs more attention. See Migrating to > Tika > > 2.0.0 > > < > https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> > for > > more details. > > > > I'd be happy to create a PR for the upgrade and look into the fixes with > > someone that has already worked with Apache Tika 2.X or the affected > > modules (extraction/langid). > > > > Best, > > Christos > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > >