Many, many thanks to Tilman for running the regression tests! The 2 new exceptions are caused by PDFBOX-5127. I'm baffled that we haven't seen these before, but they do require some rare circumstances.
The 1 new Tika exception is a zero-byte file exception. This is my fault because I changed the files between Tilman's runs. As for XMPBox, Tilman is right that when I tried to use it many years ago, it did not have the flexibility needed for PDFs in the wild. See: https://lucene.472066.n3.nabble.com/DISCUSS-options-for-XMP-parsing-td4262520.html 2016 me: "I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl" Cheers, Tim On Thu, Mar 11, 2021 at 1:34 PM Tilman Hausherr <thaush...@t-online.de> wrote: > > Am 11.03.2021 um 09:00 schrieb sahy...@fileaffairs.de: > >> The three new exceptions weren't in earlier reports. > >> > >> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there > >> is > >> a non standard schema. > > would it make sense to add that support? If yes could we get samles of > > various schema to support that development? Could look into that if we > > think that's worth the effort > > > Here's an example: > > https://issues.apache.org/jira/browse/PDFBOX-3440 > > > Tilman > > > > > > > Maruan > > > > > >> Tilman > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org