All,

Let me know how I can help. If there’s any way we can move people to
tika-pipes, that’d be best.

We have a Solr emitter already in Tika, but that might add too much
complexity for people just beginning.

I’m strongly in favor of extricating Tika’s dependencies from Solr’s for
all of the reasons mentioned.

Perhaps a meetup or telecon next week?

Best,
    Tim


On Tue, Aug 13, 2024 at 11:02 AM David Smiley <dsmi...@apache.org> wrote:

> Alternatively, just like we did with the DataImportHandler (DIH)[1],
> we migrate the Tika stuff to an independent project/home on GitHub and
> people install it if they need it.  Like the DIH, Solr's Tika
> integration is quite popular/used so I expect it'll be maintained
> instead of abandoned.  At that point, whether it's migrated to
> TikaServer or whatever is a choice up to whoever the maintainer(s)
> are.  I suppose proceeding in this direction requires volunteers.
>
> [1] https://github.com/SearchScale/dataimporthandler
>
> On Mon, Aug 12, 2024 at 1:15 PM Christos Malliaridis
> <c.malliari...@gmail.com> wrote:
> >
> > I tried to find a java client for tika, but with no success so far.
> >
> > The version upgrade would reduce the vulnerabilities from about 21 CVEs
> to
> > 6, so it would definitely be an improvement and probably worth the
> > migration effort  until a client is available.
> >
> > On Mon, 12 Aug 2024, 18:15 Jan Høydahl, <jan....@cominvent.com> wrote:
> >
> > > Hi
> > >
> > > Wrt Tika, I had been hoping that we could replace extracting handler
> with
> > > a processor that delegates to Tika Server, but is otherwise feature
> parity.
> > > It would remove tons of dependencies and attack surface from Solr.
> > >
> > > I tried a POC once but could not find a suitable Java client for Tika
> > > Server REST API. Perhaps that exists now?
> > >
> > > Jan Høydahl
> > >
> > > > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis <
> > > c.malliari...@gmail.com>:
> > > >
> > > > Hello everyone,
> > > >
> > > > I've been looking into the dependencies of the project and thought
> that
> > > we
> > > > could update a couple of them, together with their license files
> > > (wherever
> > > > necessary).
> > > >
> > > > I tried to start with Apache Tika and upgrade it from 1.28.5 to
> 2.9.2,
> > > > which is a huge step due to some restructuring of Apache Tika. The
> > > affected
> > > > modules are extraction and langid.
> > > >
> > > > There is a PR from solrbot <https://github.com/apache/solr/pull/2583
> >
> > > that
> > > > requires some manual work that I have already picked up for learning
> > > > purposes. I'd like to create a ticket for the upgrade, but also saw
> that
> > > > there is also SOLR-13973
> > > > <https://issues.apache.org/jira/browse/SOLR-13973> that
> > > > is titled "Deprecate Tika". From the age and conversation on the
> ticket,
> > > it
> > > > sounds like Tika will not be deprecated and the ticket can be closed.
> > > But I
> > > > am not sure and would like to ask for your input on this.
> > > >
> > > > In the migration to 2.9.2 it seems that there are some conflicts
> with the
> > > > way the title from documents is extracted. Some metadata tags have
> also
> > > > been removed / replaced, which needs more attention. See Migrating to
> > > Tika
> > > > 2.0.0
> > > > <
> > >
> https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
> > > for
> > > > more details.
> > > >
> > > > I'd be happy to create a PR for the upgrade and look into the fixes
> with
> > > > someone that has already worked with Apache Tika 2.X or the affected
> > > > modules (extraction/langid).
> > > >
> > > > Best,
> > > > Christos
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> > > For additional commands, e-mail: dev-h...@solr.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>

Reply via email to