On 13.08.2023 09:24, Tilman Hausherr wrote:
https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
I had only a short look but I'm optimistic. Some differences may be
because of the XMP bug.
I had another look, now at the content differences (the exceptions are
about zero length, so I guess these are temporary problems)
Most differences I couldn't reproduce with pdfbox text extraction. So
either there was a bug in my tika migration or in tika itself.
bug_trackers/poppler/poppler-106962-0.zip-0.pdf is really different.
But it's possible that we already did work (and give up) on that file.
bug_trackers/PDFBOX/PDFBOX-3875-7.pdf and the other one are different
but 2.0.29 has a bug, which we may or may not want to fix.
It's on page 10 of the PDFsam_merge.pdf file in PDFBOX-3875.
govdocs1/372/372582.pdf
is a mess so I don't care
commoncrawl3_refetched/HY/HYPIS6AQFMRDA5RQ7HTDNEOOAH3UXABH
I couldn't reproduce, "actualización" does occur 47 times in the 3.0
version. (And also in the migrated tika itself)
So it's a 👍 from me
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]