Apologies for my delay...
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-3.x-snapshot-reports.tgz
I added two new reports new_catastrophic_exceptions_in_b and
fixed_catastrophic_exceptions_in_b. The former shows which files had
a missing or 0-byte extract in B but not A. The latter shows the
opposite. We can get missing or 0-byte extracts when the app crashes
(timeout or oom or other fatal crash). Given that this is
multithreaded, all files that are currently being parsed during a
catastrophic event will have a 0-byte or missing extract. So, there
are likely some files in there that are ok.
I ran the comparison before the fix for the infinite loop that Tilman
made this morning. Note that that was a regular IOException because
TikaInputStream identified it because of too many EOFs...that did not
cause catastrophic problems.
Let me know if you have questions. I haven't looked in great detail yet...
There's every chance that I need to make some more changes on the Tika side. :D
Cheers and happy 3.x!
Best,
Tim
On Wed, Apr 7, 2021 at 9:23 AM Tim Allison <[email protected]> wrote:
>
> LOL... K. I'll build locally with the PDFBOX-5153 fix and kick it
> off today or tomorrow.
>
> On Wed, Apr 7, 2021 at 1:40 AM Tilman Hausherr <[email protected]> wrote:
> >
> > Yes it would be useful and no I haven't done it. I'm optimistic about
> > the results despite PDFBOX-5153.
> >
> > Tilman
> >
> > Am 06.04.2021 um 17:22 schrieb Tim Allison:
> > > Hi All,
> > >
> > > Would it be useful for me to run regression tests comparing 2.x with
> > > 3.0.0-RC1 now or should I wait? Or, has someone already done this?
> > >
> > > See https://issues.apache.org/jira/browse/TIKA-3347 for integration
> > > with Tika. Many thanks!
> > >
> > > Cheers,
> > >
> > > Tim
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]