Hi Tilman,
Thank you for raising this. 3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 is not
related to TIKA-3734. The updated junrar (7.5.0) is swallowing a
(new) exception on this file and stopping the parse without throwing
an exception. The earlier version of junrar (7.4.1) did not find a
problem with the file.
My ubuntu package util throws an exception on this file, and I think
it is just kind of wonky.
I'm going to fix the dependency convergence issues. Is there anything else?
Best,
Tim
On Tue, Apr 26, 2022 at 2:52 PM Tilman Hausherr <[email protected]> wrote:
>
> Am 26.04.2022 um 13:07 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/reports-tika-1.28.2-SNAPSHOT.tgz
> >
> > I found two issues that should be fixed (TIKA-3733 and TIKA-3734). I
> > think both are related to the underlying parsers being stricter (which
> > is good), but we need to change our code to handle these cases more
> > robustly.
> >
> > Let me know if you see anything else.
>
> What about commoncrawl3/3X/3X4JRZZ4TQ2GK4QQDQEXMFCVLM3FM5I4 , this is
> also a rar file and the last entry in content_diffs_no_exceptions.xlsx .
> Is that related to TIKA-3734 ?
>
> Tilman
>