On 13.08.2023 09:24, Tilman Hausherr wrote:
https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.29_vs_3.0.0.tar.xz

I had only a short look but I'm optimistic. Some differences may be because of the XMP bug.

I had another look, now at the content differences (the exceptions are about zero length, so I guess these are temporary problems)

Most differences I couldn't reproduce with pdfbox text extraction. So either there was a bug in my tika migration or in tika itself.

bug_trackers/poppler/poppler-106962-0.zip-0.pdf  is really different. But it's possible that we already did work (and give up) on that file.


bug_trackers/PDFBOX/PDFBOX-3875-7.pdf and the other one are different but 2.0.29 has a bug, which we may or may not want to fix.

It's on page 10 of the PDFsam_merge.pdf file in PDFBOX-3875.


govdocs1/372/372582.pdf

is a mess so I don't care


commoncrawl3_refetched/HY/HYPIS6AQFMRDA5RQ7HTDNEOOAH3UXABH

I couldn't reproduce, "actualización" does occur 47 times in the 3.0 version. (And also in the migrated tika itself)


So it's a 👍 from me

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to