> Re 308576.pdf: the text extraction has a huge loss, but a manual check shows > it is identical. However that file has the NPE from PDActionURI.getURI(), > could it be that this results in an abort of text extraction? Same for 569017.pdf.
Likely. There are two "per file pair contents" files. The one ending with "_ignore_exceptions.xlsx" means that results are not reported if there was an exception caught for one of the files (308576.pdf and 569017.pdf aren't in that file). The other one "*_with_exceptions" includes both. Based on your feedback, I should add 2 boolean cols to "*_with_exceptions.xlsx" for exceptionInA and exceptionInB?
