Am 28.07.2020 um 23:51 schrieb Tim Allison:
Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Thank you. Besides the exceptions, there are a few cases in content extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has meaningful content, that is suspicious and needs further investigation.
Tilman