Many, many thanks to Tilman for running the regression tests!

The 2 new exceptions are caused by PDFBOX-5127.  I'm baffled that we
haven't seen these before, but they do require some rare
circumstances.

The 1 new Tika exception is a zero-byte file exception.  This is my
fault because I changed the files between Tilman's runs.

As for XMPBox, Tilman is right that when I tried to use it many years
ago, it did not have the flexibility needed for PDFs in the wild.
See: 
https://lucene.472066.n3.nabble.com/DISCUSS-options-for-XMP-parsing-td4262520.html

2016 me: "I found that it fails on roughly 40% of XMPs I pulled out of
PDFs from govdocs1/commoncrawl"

Cheers,

             Tim

On Thu, Mar 11, 2021 at 1:34 PM Tilman Hausherr <thaush...@t-online.de> wrote:
>
> Am 11.03.2021 um 09:00 schrieb sahy...@fileaffairs.de:
> >> The three new exceptions weren't in earlier reports.
> >>
> >> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
> >> is
> >> a non standard schema.
> > would it make sense to add that support? If yes could we get samles of
> > various schema to support that development? Could look into that if we
> > think that's worth the effort
>
>
> Here's an example:
>
> https://issues.apache.org/jira/browse/PDFBOX-3440
>
>
> Tilman
>
>
>
> >
> > Maruan
> >
> >
> >> Tilman
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to