It could be discussed at our next community meetup. Or a dedicated one for this topic if it will dominate.
On Tue, Aug 13, 2024 at 12:21 PM Tim Allison <talli...@apache.org> wrote: > > All, > > Let me know how I can help. If there’s any way we can move people to > tika-pipes, that’d be best. > > We have a Solr emitter already in Tika, but that might add too much > complexity for people just beginning. > > I’m strongly in favor of extricating Tika’s dependencies from Solr’s for > all of the reasons mentioned. > > Perhaps a meetup or telecon next week? > > Best, > Tim > > > On Tue, Aug 13, 2024 at 11:02 AM David Smiley <dsmi...@apache.org> wrote: > > > Alternatively, just like we did with the DataImportHandler (DIH)[1], > > we migrate the Tika stuff to an independent project/home on GitHub and > > people install it if they need it. Like the DIH, Solr's Tika > > integration is quite popular/used so I expect it'll be maintained > > instead of abandoned. At that point, whether it's migrated to > > TikaServer or whatever is a choice up to whoever the maintainer(s) > > are. I suppose proceeding in this direction requires volunteers. > > > > [1] https://github.com/SearchScale/dataimporthandler > > > > On Mon, Aug 12, 2024 at 1:15 PM Christos Malliaridis > > <c.malliari...@gmail.com> wrote: > > > > > > I tried to find a java client for tika, but with no success so far. > > > > > > The version upgrade would reduce the vulnerabilities from about 21 CVEs > > to > > > 6, so it would definitely be an improvement and probably worth the > > > migration effort until a client is available. > > > > > > On Mon, 12 Aug 2024, 18:15 Jan Høydahl, <jan....@cominvent.com> wrote: > > > > > > > Hi > > > > > > > > Wrt Tika, I had been hoping that we could replace extracting handler > > with > > > > a processor that delegates to Tika Server, but is otherwise feature > > parity. > > > > It would remove tons of dependencies and attack surface from Solr. > > > > > > > > I tried a POC once but could not find a suitable Java client for Tika > > > > Server REST API. Perhaps that exists now? > > > > > > > > Jan Høydahl > > > > > > > > > 12. aug. 2024 kl. 16:20 skrev Christos Malliaridis < > > > > c.malliari...@gmail.com>: > > > > > > > > > > Hello everyone, > > > > > > > > > > I've been looking into the dependencies of the project and thought > > that > > > > we > > > > > could update a couple of them, together with their license files > > > > (wherever > > > > > necessary). > > > > > > > > > > I tried to start with Apache Tika and upgrade it from 1.28.5 to > > 2.9.2, > > > > > which is a huge step due to some restructuring of Apache Tika. The > > > > affected > > > > > modules are extraction and langid. > > > > > > > > > > There is a PR from solrbot <https://github.com/apache/solr/pull/2583 > > > > > > > that > > > > > requires some manual work that I have already picked up for learning > > > > > purposes. I'd like to create a ticket for the upgrade, but also saw > > that > > > > > there is also SOLR-13973 > > > > > <https://issues.apache.org/jira/browse/SOLR-13973> that > > > > > is titled "Deprecate Tika". From the age and conversation on the > > ticket, > > > > it > > > > > sounds like Tika will not be deprecated and the ticket can be closed. > > > > But I > > > > > am not sure and would like to ask for your input on this. > > > > > > > > > > In the migration to 2.9.2 it seems that there are some conflicts > > with the > > > > > way the title from documents is extracted. Some metadata tags have > > also > > > > > been removed / replaced, which needs more attention. See Migrating to > > > > Tika > > > > > 2.0.0 > > > > > < > > > > > > https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> > > > > for > > > > > more details. > > > > > > > > > > I'd be happy to create a PR for the upgrade and look into the fixes > > with > > > > > someone that has already worked with Apache Tika 2.X or the affected > > > > > modules (extraction/langid). > > > > > > > > > > Best, > > > > > Christos > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org