Reports are here: https://corpora.tika.apache.org/base/reports/tika-2.3-vs-2.4-pdfs.tgz
It looks like no significant changes. Some diffs on a few files, but this was run on ~800k PDFs. There are a couple of cases where a file is now being detected as rfc822 instead of PDF. We have to fix that on the Tika side. On Mon, Mar 21, 2022 at 12:53 PM Andreas Lehmkuehler <andr...@lehmi.de> wrote: > > > Am 21.03.22 um 12:21 schrieb Tim Allison: > > I'm happy to run the tests today if that would be of any interest. > Yes, please. > > TIA > Andreas > > > > > > On Sun, Mar 20, 2022 at 5:01 PM Andreas Lehmkuehler <andr...@lehmi.de> > > wrote: > >> > >> Am 13.03.22 um 14:20 schrieb Tim Allison: > >>> From Tika's perspective, there's no rush. We're waiting for a bug fix > >>> in POI (TIKA-3699). > >>> > >>> Please let me know if/when I should run the regression tests. > >> Thanks for the offer. Do we need to run the tests before cutting the > >> release? > >> > >> Most of the tickets aren't related to text extraction. Those which are > >> related > >> should decrease the number of exceptions and increase the accuracy. > >> > >> WDYT? > >> > >> > >>> > >>> Thank you, all! > >>> > >>> Cheers, > >>> > >>> Tim > >>> > >>> On Sat, Mar 12, 2022 at 5:29 AM Andreas Lehmkuehler <andr...@lehmi.de> > >>> wrote: > >>>> > >>>> Am 11.03.22 um 08:30 schrieb Tilman Hausherr: > >>>>> Am 11.03.2022 um 08:19 schrieb Andreas Lehmkuehler: > >>>>>> Am 10.03.22 um 20:16 schrieb Tilman Hausherr: > >>>>>>> I'd agree but that might mean PDFBOX-5384 wouldn't be fixed. > >>>>>> It's there for quite some time and it seems to be a seldom corner > >>>>>> case. IMHO > >>>>>> it can wait if we won't find a solution before Monday. > >>>>> > >>>>> No, that one was created on March 2nd. Oliver has just posted a > >>>>> suggestion so > >>>>> maybe that is a solution. > >>>> The ticket is quite new, but the issue itself was introduced in 2018 with > >>>> 2.0.12. ;-) > >>>> > >>>> However, I'll have a look at the proposed solution. > >>>> > >>>> Andreas > >>>>> > >>>>> Tilman > >>>>> > >>>>> > >>>>>> > >>>>>> WDYT? > >>>>>> > >>>>>> Andreas > >>>>>> > >>>>>>> > >>>>>>> Tilman > >>>>>>> > >>>>>>> Am 10.03.2022 um 19:05 schrieb Andreas Lehmkuehler: > >>>>>>>> Am 09.03.22 um 17:07 schrieb Tim Allison: > >>>>>>>>> All, > >>>>>>>>> > >>>>>>>>> I've been out of the office for a bit and haven't caught up yet. > >>>>>>>>> Apologies if I've missed the discussion. > >>>>>>>>> > >>>>>>>>> Are there plans for a 2.0.26 release? We're probably a few weeks > >>>>>>>>> out > >>>>>>>> How about cutting the release next Monday? > >>>>>>>> > >>>>>>>> Andreas > >>>>>>>> > >>>>>>>>> from starting our next 1.x and 2.x releases on Tika, and it would be > >>>>>>>>> great to incorporate 2.0.26. No problem at all if 2.0.26 is slated > >>>>>>>>> for later. > >>>>>>>>> > >>>>>>>>> Thank you! > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> > >>>>>>>>> Tim > >>>>>>>>> > >>>>>>>>> On Fri, Mar 4, 2022 at 10:46 PM Tilman Hausherr > >>>>>>>>> <thaush...@t-online.de> wrote: > >>>>>>>>>> > >>>>>>>>>> Am 24.02.2022 um 07:41 schrieb Andreas Lehmkuehler: > >>>>>>>>>>> Am 22.02.22 um 07:49 schrieb Andreas Lehmkuehler: > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> I'm planning to cut a new JBIG2 release next week. There aren't > >>>>>>>>>>>> that > >>>>>>>>>>>> much changes but I think the fixes are worth to be released. [1] > >>>>>>>>>>> I'm going to cut the release next weekend, if nobody objects. > >>>>>>>>>>> > >>>>>>>>>>> Once it is done we should think about a 2.0.26 release of PDFBox > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes please! > >>>>>>>>>> > >>>>>>>>>> Tilman > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> --------------------------------------------------------------------- > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>>>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>>>>> > >>>>>> > >>>>>> > >>>>>> --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>>>> > >>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >>> > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org